Machine Learning Essentials

Chief AI Officer Program at Costello College of Business (GMU)

Vadim Sokolov

George Mason University

today

Brief History of AI

Mechanical machines

Robots and Automatic Machines Were Generally Very Inventive: Al-Jazari (XII Century)

Hesdin Castle (Robert II of Artois), Leonardo’s robot…

Mechanical machines

Jaquet-Droz automata (XVIII century):

Mechanical machines

  • But this is in mechanics, in mathematics/logic AI it was quite rudimentary for a long time

Logic machine of Ramon Llull (XIII-XIV centuries)

  • Starting with Dr. Frankenstein, further AI in the literature appears constantly …

Shennon’s Theseus

  • YouTube Video
  • Early 1950s, Claude Shannon (The father of Information Theory) demonstrates Theseus
  • A life-sized magnetic mouse controlled by relay circuits, learns its way around a maze.

1956-1960: Great hopes

  • Optimistic time. It seemed a that we were almost there…
  • Allen Newell, Herbert A. Simon, and Cliff Shaw: Logic Theorist.
  • Automated reasoning.
  • It was able to prof most of the Principia Mathematica, in some places even more elegant than Russell and Whitehead.

1956-1960: Big Hopes

  • General Problem Solver - a program that tried to think as a person
  • A lot of programs that have been able to do some limited things (MicroWorlds):
    • Analogy (IQ tests with multiple choice questions)
    • Student (algebraic verbal tasks)
    • Blocks World (rearranged 3D blocks).

1970s: Knowledge Based Systems

  • The bottom line: to accumulate a fairly large set of rules and knowledge about the subject area, then draw conclusions.
  • First success: MYCIN - Diagnosis of blood infections:
    • about 450 rules
    • The results are like an experienced doctor and significantly better than beginner doctors.

1980-2010: Commercial applications Industry AI

  • The first AI department was at Dec (Digital Equipment Corporation). It is argued that by 1986 he saved the Dec about $10 million per year.
  • The boom ended by the end of the 80s, when many companies could not live up to high expectations.

Rule-Based System vs Bayes

Old AI


If rain outside, then take umbrella

This rule cannot be learned from data. It does not allow inference. Cannot say anything about rain outside if I see an umbrella.


 

New AI

Probability of taking umbrella, given there is rain

Conditional probability rule can be learned from data. Allows for inference. We can calculate the probability of rain outside if we see an umbrella.

  • Bayesian approach is a powerful statistical framework based on the work of Thomas Bayes and later Laplace.
  • It provides a probabilistic approach to reasoning and learning
  • Allowing us to update our beliefs about the world as we gather new data.
  • This makes it a natural fit for artificial intelligence, where we often need to deal with uncertainty and incomplete information.

DEFINITION

  • How to determine “learning”?

Definition:

The computer program learns as the data is accumulating relative to a certain problem class \(T\) and the target function of \(P\) if the quality of solving these problems (relative to \(P\)) improves with gaining new experience.

  • The definition is very (too?) General.
  • What specific examples can be given?

Tasks and concepts of ML

Tasks and concepts of ML: Supervised Learning

  • training sample – a set of examples, each of which consists of input features (attributes) and the correct “answers” - the response variable
  • Learn a rule that maps input features to the response variable
  • Then this rule is applied to new examples (deployment)
  • The main thing is to train a model that explains not only examples from the training set, but also new examples (generalizes)
  • Otherwise - overfitting

Tasks and concepts of ML: unsupervised learning

There are no correct answers, only data, e.g. clustering:

  • We need to divide the data into pre -unknown classes to some extent similar:
    • highlight the family of genes from the sequences of nucleotides
    • cluster users and personalize the application for them
    • cluster the mass spectrometric image to parts with different composition

Tasks and concepts of ML: unsupervised learning

  • Dimensionality reduction: data have a high dimension, it is necessary to reduce it, select the most informative features so that all of the above algorithms can work
  • Matrix Competition: There is a sparse matrix, we must predict what is in the missing positions.
  • Anomaly detection: find anomalies in the data, e.g. fraud detection.
  • Often the outputs answers are given for a small part of the data, then we call it semi -supervised Learning.

Tasks and concepts of ML: reinforcement learning

  • Multi-armed bandits: there is a certain set of actions, each of which leads to random results, you need to get as much rewards possible
  • Exploration vs.Exploitation: how and when to proceed from the study of the new to use what has already studied
  • Credit Assignment: You get rewarded at the very end (won the game), and we must somehow distribute this reward on all the moves that led to victory.

Tasks and concepts of ML: active learning

  • Active Learning - how to choose the following (relatively expensive) test
  • Boosting - how to combine several weak classifiers so that it turns out good
  • Model Selection - where to draw a line between models with many parameters and with a few.
  • Ranking: response list is ordered (internet search)

Tasks and concepts of AI

Tasks and concepts of AI: Reasoning

  • Bayesian networks: given conditional probabilities, calculate the probability of the event
  • o1 by OpenAI: a family of AI models that are designed to perform complex reasoning tasks, such as math, coding, and science. o1 models placed among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME)
  • Gemini 2.0: model for the agentic era

Tasks and concepts of AI: Representation

  • Knowledge Graphs: a graph database that uses semantic relationships to represent knowledge
  • Embeddings: a way to represent data in a lower-dimensional space
  • Transformers: a deep learning model that uses self-attention to process sequential data

Tasks and concepts of AI: Generation

In shadows of data, uncertainty reigns,
Bayesian whispers, where knowledge remains.
With prior beliefs, we start our quest,
Updating with evidence, we strive for the best.

A dance of the models, predictions unfold,
Inferences drawn, from the new and the old.
Through probabilities, we find our way,
In the world of AI, it’s the Bayesian sway.

So gather your data, let prior thoughts flow,
In the realm of the unknown, let your insights grow.
For in this approach, with each little clue,
We weave understanding, both rich and true.

Music

Tasks and concepts of AI: Generation

from openai import OpenAI
client = OpenAI(api_key="your-api-key")
response = client.images.generate(
    model="dall-e-3",
    prompt="a hockey player trying to understand the Bayes rule",
    size="1024x1024",
    quality="standard",
    n=1,
)

print(response.data[0].url)

Tasks and concepts of AI: Generation

A humorous and illustrative scene of a hockey player sitting on a bench in full gear, holding a hockey stick in one hand and a whiteboard marker in th

Chess and AI

Old AI: Deep Blue (1997) vs. Garry Kasparov

Kasparov vs IBM’s DeepBlue in 1997

AlphaGo Zero

  • Remove all human knowledge from training process - only uses self play,
  • Takes raw board as input and neural network predicts the next move.
  • Uses Monte Carlo tree search to evaluate the position.
  • The algorithm was able to beat AlphaGo 100-0. The algorithm was then used to play chess and shogi and was able to beat the best human players in those games as well.

Alpha GO vs Lee Sedol: Move 37 by AlphaGo in Game Two

Probability in machine learning

  • In all methods and approaches, it is useful not only generate an answer, but also evaluate how confident in this answer, how well the model describes the data, how these values will change in further experiments, etc.
  • Therefore, the central role in machine learning is played by the theory of probability - and we will also actively use it.

Bayes Approach

Review of Basic Probability Concepts

Probability lets us talk efficiently about things that we are uncertain about.

  • What will Amazon’s sales be next quarter?
  • What will the return be on my stocks next year?
  • How often will users click on a particular Google ad?

All these involve estimating or predicting unknowns!!

Random Variables

Random Variables are numbers that we are not sure about. There’s a list of potential outcomes. We assign probabilities to each outcome.

Example: Suppose that we are about to toss two coins. Let \(X\) denote the number of heads. We call \(X\) the random variable that stands for the potential outcome.

Probability

Probability is a language designed to help us communicate about uncertainty. We assign a number between \(0\) and \(1\) measuring how likely that event is to occur. It’s immensely useful, and there’s only a few basic rules.

  1. If an event \(A\) is certain to occur, it has probability \(1\), denoted \(P(A)=1\)
  2. Either an event \(A\) occurs or it does not. \[P(A) = 1 - P(\text{not }A)\]
  3. If two events are mutually exclusive (both cannot occur simultaneously) then \[P(A \text{ or } B) = P(A) + P(B)\]
  4. Joint probability, when events are independent \[P(A \text{ and } B) = P( A) P(B)\]

Probability Distribution

We describe the behavior of random variables with a Probability Distribution

Example: Suppose we are about to toss two coins. Let \(X\) denote the number of heads.

\[X = \left\{ \begin{array}{ll} 0 \text{ with prob. } 1/4\\ 1 \text{ with prob. } 1/2\\ 2 \text{ with prob. } 1/4 \end{array} \right.\]

\(X\) is called a Discrete Random Variable

Question: What is \(P(X=0)\)? How about \(P(X \geq 1)\)?

Example: Happiness Index

“happiness index” as a function of salary.

Salary (\(X\)) Happiness (\(Y\)): 0 (low) 1 (medium) 2 (high)
low 0 0.03 0.12 0.07
medium 1 0.02 0.13 0.11
high 2 0.01 0.13 0.14
very high 3 0.01 0.09 0.14

Is \(P(Y=2 \mid X=3) > P(Y=2)\)?

Bayes Rule

The computation of \(P(x \mid y)\) from \(P(x)\) and \(P(y \mid x)\) is called Bayes theorem: \[ P(x \mid y) = \frac{P(y,x)}{P(y)} = \frac{P(y\mid x)p(x)}{p(y)} \]

This shows now the conditional distribution is related to the joint and marginal distributions.

You’ll be given all the quantities on the r.h.s.

Bayes Rule

Key fact: \(P(x \mid y)\) is generally different from \(P(y \mid x)\)!

Example: Most people would agree

\[\begin{align*} Pr & \left ( Practice \; hard \mid Play \; in \; NBA \right ) \approx 1\\ Pr & \left ( Play \; in \; NBA \mid Practice \; hard \right ) \approx 0 \end{align*}\]

The main reason for the difference is that \(P( Play \; in \; NBA ) \approx 0\).

Independence

Two random variable \(X\) and \(Y\) are independent if \[ P(Y = y \mid X = x) = P (Y = y) \] for all possible \(x\) and \(y\) values. Knowing \(X=x\) tells you nothing about \(Y\)!

Example: Tossing a coin twice. What’s the probability of getting \(H\) in the second toss given we saw a \(T\) in the first one?

Sally Clark Case: Independence or Bayes?

Sally Clark was accused and convicted of killing her two children

They could have both died of SIDS.

  • The chance of a family which are non-smokers and over 25 having a SIDS death is around 1 in 8,500.

  • The chance of a family which has already had a SIDS death having a second is around 1 in 100.

  • The chance of a mother killing her two children is around 1 in 1,000,000.

Bayes or Independence

  1. Under Bayes \[\begin{align*} P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) & = P \left( \mathrm{first} \; \mathrm{SIDS} \right) P \left( \mathrm{Second} \; \; \mathrm{SIDS} | \mathrm{first} \; \mathrm{SIDS} \right) \\ & = \frac{1}{8500} \cdot \frac{1}{100} = \frac{1}{850,000} \end{align*}\]

The \(\frac{1}{100}\) comes from taking into account genetics.

  1. Independence, as the court did, gets you

\[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000) \]

  1. By Bayes rule

\[ \frac{p(I|E)}{p(G|E)} = \frac{P( E \cap I)}{P( E \cap G)} \] \(P( E \cap I) = P(E|I )P(I)\) needs discussion of \(p(I)\).

Random Variables: Expectation \(E(X)\)

The expected value of a random variable is simply a weighted average of the possible values X can assume.

The weights are the probabilities of occurrence of those values.

\[E(X) = \sum_x xP(X=x)\]

With \(n\) equally likely outcomes with values \(x_1, \ldots, x_n\), \(P(X = x_i) = 1/n\)

\[E(X) = \frac{x_1+x_2+\ldots+x_n}{n}\]

Roulette Expectation

  • European Odds: 36 numbers (red/black) + zero
  • You bet $1 on 11 Black (pays 35 to 1)
  • \(X\) is the return on this bet

\[E(X) = \frac{1}{37}\times 36 + \frac{36}{37}\times 0 = 0.97\]

  • If you bet $1 on Black (pays 1 to 1)

\[E(X) = \frac{18}{37}\times 2 + \frac{19}{37}\times 0 = 0.97\]

Casino is guaranteed to make money in the long run!

Standard Deviation \(sd(X)\) and Variance \(Var(X)\)

The variance is calculated as

\[Var(X) = E\left((X - E(X))^2\right)\]

A simpler calculation is \(Var(X) = E(X^2) - E(X)^2\).

The standard deviation is the square-root of variance.

\[sd(X) = \sqrt{Var(X)}\]

Roulette Variance

  • European Odds: 36 numbers (red/black) + zero
  • You bet $1 on 11 Black (pays 35 to 1)
  • \(X\) is the return on this bet

\[Var(X) = \frac{1}{37}\times (36 - 0.97)^2 + \frac{36}{37}\times (0 - 0.97)^2 = 34\]

  • If you bet $1 on Black (pays 1 to 1)

\[Var(X) = \frac{18}{37}\times (2 - 0.97)^2+ \frac{19}{37}\times (0- 0.97)^2 = 1\]

If your goal is to spend as much time as possible in the casino (free drinks): place small bets on black/red

Example: \(E(X)\) and \(Var(X)\)

Tortoise and Hare are selling cars. Probability distributions, means and variances for \(X\), the number of cars sold

0 1 2 3 Mean Variance sd
cars sold \(X\) \(E(X)\) \(Var(X)\) \(\sqrt{Var(X)}\)
Tortoise 0 0.5 0.5 0 1.5 0.25 0.5
Hare 0.5 0 0 0.5 1.5 2.25 1.5

Expectation and Variance Calculations

Let’s do Tortoise expectations and variances

  • The Tortoise \[\begin{align*} E(T) &= (1/2)(1) + (1/2)(2) = 1.5 \\ Var(T) &= E(T^2) - E(T)^2 \\ &= (1/2)(1)^2 + (1/2)(2)^2 - (1.5)^2 = 0.25 \end{align*}\]

  • Now the Hare’s \[\begin{align*} E(H) &= (1/2)(0) + (1/2)(3) = 1.5 \\ Var(H) &= (1/2)(0)^2 + (1/2)(3)^2- (1.5)^2 = 2.25 \end{align*}\]

Expectation and Variance Interpretation

What do these tell us above the long run behavior?

  • Tortoise and Hare have the same expected number of cars sold.
  • Tortoise is more predictable than Hare. He has a smaller variance The standard deviations \(\sqrt{Var(X)}\) are \(0.5\) and \(1.5\), respectively
  • Given two equal means, you always want to pick the lower variance.

Linear Combinations of Random Variables

Two key properties:

Let \(a, b\) be given constants

  • Expectations and Variances \[\begin{align*} E(aX + bY) &= a E(X) + b E(Y) \\ Var(aX + bY) &= a^2 Var(X) + b^2 Var(Y) + 2 ab Cov(X,Y) \end{align*}\]

where \(Cov(X,Y)\) is the covariance between random variables.

Tortoise and Hare Portfolio

What about Tortoise and Hare? We need to know \(Cov(\text{Tortoise, Hare})\). Let’s take \(Cov(T,H) = -1\) and see what happens

Suppose \(a = \frac{1}{2}, b= \frac{1}{2}\) Expectation and Variance

\[\begin{align*} E\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{2} E(T) + \frac{1}{2} E(H) = \frac{1}{2} \times 1.5 + \frac{1}{2} \times 1.5 = 1.5 \\ Var\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{4} 0.25 + \frac{1}{4} 2.25 - 2 \frac{1}{2} \frac{1}{2} = 0.625 - 0.5 = 0.125 \end{align*}\]

Much lower!

Bayesian Updating

“Personalization" \(=\)”Conditional Probability"

  • Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.
  • Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.


Many Business Applications!! Suggestions vs Search….

Bayes’s Rule in Medical Diagnostics

Alice is a 40-year-old women, what is the chance that she really has breast cancer when she gets positive mammogram result, given the conditions:

  1. The prevalence of breast cancer among people like Alice is 1%.
  2. The test has an 80% detection rate.
  3. The test has a 10% false-positive rate.

The posterior probability \(P(\text{cancer} \mid \text{positive mammogram})\)?

Medical Diagnostics - Visualization

Medical Screening

Of 1000 cases:

  • 108 positive mammograms. 8 are true positives. The remaining 100 are false positives.
  • 892 negative mammograms. 2 are false negatives. The other 890 are true negatives.

“Personalization” = “Conditional Probability”

Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.

Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.

Many Business Applications!! Suggestions vs Search, ….

How does Netflix Give Recommendations?

Will a subscriber like Saving Private Ryan, given that he or she liked the HBO series Band of Brothers?

Both are epic dramas about the Normandy invasion and its aftermath.

100 people in your database, and every one of them has seen both films.

Their viewing histories come in the form of a big “ratings matrix”.

Liked Band of Brothers Didn’t like it
Liked Saving Private Ryan 56 subscribers 6 subscribers
Didn’t like it 14 subscribers 24 subscribers

\[P(\text{likes Saving Private Ryan} \mid \text{likes Band of Brothers})=\frac{56}{56+14}=80\%\]

How does Netflix Give Recommendations? - Complexity

But real problem is much more complicated:

  1. Scale. It has 100 million subscribers and ratings data on more than 10,000 shows. The ratings matrix has more than a trillion possible entries.
  2. “Missingness”. Most subscribers haven’t watched most films. Moreover, missingness pattern is informative.
  3. Combinatorial explosion. In a database with 10,000 films, no one else’s history is exactly the same as yours.

The solution to all three issues is careful modeling.

How does Netflix Give Recommendations? - Fundamental Equation

The fundamental equation is: \[\text{Predicted Rating} =\text{Overall Average} + \text{Film Offset} + \text{User Offset} + \text{User-Film Interaction}\]

These three terms provide a baseline for a given user/film pair:

  • The overall average rating across all films is 3.7.
  • Every film has its own offset. Popular movies have positive offsets.
  • Every user has an offset. Some users are more or less critical than average.

Netflix - Latent Features

  • The User-Film Interaction is calculated based on a person’s ratings of similar films exhibit patterns because those ratings are all associated with a latent feature of that person.
  • There’s not just one latent feature to describe Netflix subscribers, but dozens or even hundreds. There’s a “British murder mystery” feature, a “gritty character-driven crime drama” feature, a “cooking show” feature, a “hipster comedy films” feature, …

The Hidden Features Tell the Story

  • These latent features are the magic elixir of the digital economy–a special brew of data, algorithms, and human insight that represents the most perfect tool ever conceived for targeted marketing.
  • Your precise combination of latent features–your tiny little corner of a giant multidimensional Euclidean space–makes you a demographic of one.
  • Netflix spent $130 million for 10 episodes on The Crown. Other network television: $400 million commissioning 113 pilots, of which 13 shows made it to a second season.

Probability Distributions

Why Distributions Matter

#| include: false
knitr::knit_hooks$set(pars = function(before, options, envir) {
  if (before) {
    par(mar = c(2.5, 2.5, 0.7, 0.5), bty = "n", cex.lab = 0.7, cex.axis = 0.5, cex.main = 0.5, pch = 20, cex = 1, mgp = c(1.5, 0.4, 0))
  }
})
library(tidyverse)
library(broom)
library(knitr)
  • Machine learning is built on probability
  • Distributions describe uncertainty in data
  • Three fundamental distributions:
    • Binomial: Binary outcomes (yes/no, win/lose)
    • Poisson: Count data (arrivals, events)
    • Normal: Continuous measurements (heights, returns)

Why Should Executives Care?

Business Question Distribution
Will the customer buy? Binomial
How many orders today? Poisson
What’s the forecast error? Normal

Choosing the right distribution is the first step in building a reliable model. Wrong distribution = wrong predictions!

Binomial Distribution

Models the number of successes in \(n\) independent trials, each with probability \(p\)

\[P(X=k) = \binom{n}{k} p^k(1-p)^{n-k}\]

Key Parameters:

  • \(n\) = number of trials
  • \(p\) = probability of success
  • Mean = \(np\)
  • Variance = \(np(1-p)\)

Examples: A/B test conversions, click-through rates, quality defects

#| echo: false
#| fig-height: 5
barplot(dbinom(0:20, size = 20, prob = 0.3),
  names.arg = 0:20, col = "steelblue",
  xlab = "Number of Successes", ylab = "Probability",
  main = "Binomial(n=20, p=0.3)")

NFL Patriots Coin Toss

The Patriots won 19 out of 25 coin tosses in 2014-15. How likely?

  • There are 177,100 ways to arrange 19 wins in 25 games
  • Each specific sequence has probability \(0.5^{25}\)
  • Combined probability: 0.5% or odds of 199 to 1 against
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
# "25 choose 19" = number of ways to pick 19 wins from 25 games
choose(25, 19)

# Probability = (ways to get 19 wins) × (probability of any specific sequence)
choose(25, 19) * 0.5^25

The “Law of Large Numbers” Perspective:

With 32 NFL teams over 20+ years, some team will have a suspicious streak!

Key insight: Probability of Patriots specifically = 0.5%. But probability that some team has a streak ≈ much higher!

Business lesson: When auditing for fraud or anomalies:

  • Don’t just flag rare events
  • Consider how many opportunities for rare events exist
  • Adjust for “multiple comparisons”

Looking at enough data, you’ll always find something “unusual”

Predicting Premier League Goals

How many goals will a team score? Historical EPL data:

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
epl <- read.csv("data/epl.csv")
epl[1:5, c("home_team_name", "away_team_name", "home_score", "guest_score")]

Each row = one match with final scores.

The Business Problem:

Sports betting: $200+ billion industry.

Our approach:

  1. Analyze historical data
  2. Model goals as random events
  3. Estimate team strengths
  4. Simulate matches

Who uses this? FiveThirtyEight, ESPN, DraftKings, Betfair, team analytics

EPL Goals: Mean ≈ Variance

A key signature of Poisson data: the mean equals the variance.

  • Teams score about 1.4 goals per match on average
  • The variance is also ~1.4 — this is the Poisson fingerprint!
  • If variance were much larger, we’d need a different model
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
goals <- c(epl$home_score, epl$guest_score)
mean(goals)  # Average goals per team per match
var(goals)   # Variance ≈ Mean suggests Poisson!

Model Diagnostics: Mean vs Variance

Relationship Suggests
Variance ≈ Mean Poisson ✓
Variance > Mean Overdispersion (Negative Binomial)
Variance < Mean Underdispersion (rare)

Other Poisson Applications:

  • Call center arrivals per hour
  • Website clicks per minute
  • Insurance claims per year
  • Manufacturing defects per batch
  • Emails received per day

Poisson is the “go-to” for count data!

Goals Follow a Poisson Distribution

#| code-fold: true
#| code-summary: "Show R code"
#| fig-height: 5.5
goals <- c(epl$home_score, epl$guest_score)
lambda <- mean(goals)
x <- 0:8
observed <- table(factor(goals, levels = x)) / length(goals)
expected <- dpois(x, lambda = lambda)

barplot(rbind(observed, expected), beside = TRUE, 
        names.arg = x, col = c("steelblue", "coral"),
        xlab = "Goals Scored", ylab = "Proportion",
        legend.text = c("Observed", "Poisson Model"))

Model Validation:

The Poisson model (coral bars) fits the observed data (blue bars) remarkably well!

What this tells us:

  • Goals are indeed rare, independent events
  • The Poisson assumption is justified
  • We can use this model for predictions

Slight discrepancy at 0 goals: Real matches have slightly fewer 0-0 draws than Poisson predicts (teams try harder when level!)

Poisson Distribution

Models count of random events: goals, arrivals, defects, clicks

\[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]

  • \(\lambda\) (lambda): expected rate of events
  • Key property: Mean = Variance = \(\lambda\)
  • Events occur independently at a constant average rate

Business Applications:

  • Customer arrivals per hour
  • Website clicks per day
  • Manufacturing defects per batch
  • Insurance claims per year
  • Server requests per minute

If events are rare and independent, Poisson is your model!

Improving the Model: Team Strength

A single \(\lambda\) for all teams is too simple. Better model:

\[\lambda_{ij} = \text{Attack}_i \times \text{Defense}_j \times \text{HomeAdvantage}\]

  • Attack: How good is team \(i\) at scoring?
  • Defense: How weak is team \(j\) at defending?
  • Home advantage: ~0.4 extra goals at home

This is how real sports analytics works:

  1. Estimate each team’s offensive/defensive strength from historical data
  2. Adjust for home/away effects
  3. Predict expected goals for each team
  4. Use Poisson to generate win/draw/loss probabilities

Same framework applies to:

  • NBA point spreads
  • NFL betting lines
  • Cricket run predictions
  • Baseball run expectations

Team-Specific \(\lambda\): Arsenal vs Liverpool

To predict a specific match, we estimate each team’s scoring rate:

  • Arsenal’s attack: How many goals do they typically score at home?
  • Liverpool’s defense: How many goals do they typically concede away?
  • Adjustment: Scale by league average to get relative strength

For Arsenal vs Liverpool at home, we estimate Arsenal will score about 1.8 goals on average. Liverpool’s away \(\lambda\) would be calculated similarly.

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
# Simple estimate: average goals scored and conceded
arsenal_attack <- mean(epl$home_score[epl$home_team_name == "Arsenal"])
liverpool_defense <- mean(epl$home_score[epl$away_team_name == "Liverpool"])
league_avg <- mean(goals)

# Arsenal's expected goals vs Liverpool (simplified)
lambda_arsenal <- arsenal_attack * (liverpool_defense / league_avg)
lambda_arsenal

Monte Carlo Simulation

Once we have \(\lambda\) for each team, we can simulate the match thousands of times.

For Arsenal (\(\lambda=1.8\)) vs Liverpool (\(\lambda=1.5\)), running 10,000 simulations gives:

  • Arsenal wins: ~42% of simulations
  • Draw: ~24% of simulations
  • Liverpool wins: ~34% of simulations

This is how betting companies set their odds!

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
set.seed(42)
n_sims <- 10000
# Simulate Arsenal vs Liverpool
arsenal_goals <- rpois(n_sims, lambda = 1.8)  # λ for Arsenal
liverpool_goals <- rpois(n_sims, lambda = 1.5) # λ for Liverpool

# Match outcomes
c(Arsenal_Win = mean(arsenal_goals > liverpool_goals),
  Draw = mean(arsenal_goals == liverpool_goals),
  Liverpool_Win = mean(arsenal_goals < liverpool_goals))

Why Monte Carlo?

Each simulation draws random goals from Poisson distributions

  • Run 10,000 simulations → get probability of each outcome
  • Can extend to simulate entire season, league standings
  • Same approach used by betting companies and analytics firms

This is how FiveThirtyEight and bookmakers build their models!

Monte Carlo Applications:

  • Finance: Option pricing, portfolio risk (VaR)
  • Insurance: Claim projections, reserve calculations
  • Operations: Supply chain uncertainty, demand forecasting
  • Engineering: Reliability analysis, quality control
  • AI: Reinforcement learning, MCMC for Bayesian inference

When math is too hard, simulate!

Central Limit Theorem (CLT)

The most important theorem in statistics:

The average of many independent random events tends toward a Normal distribution, regardless of the original distribution.

Why it matters: Stock returns, measurement errors, test scores — all tend to be Normal because they’re sums of many small effects.

Practical Implications:

  • Sample means are approximately Normal (even if data isn’t)
  • Confidence intervals work because of CLT
  • A/B testing relies on CLT for significance tests
  • Quality control uses CLT for process monitoring

Rule of thumb: Sample size ≥ 30 usually sufficient for CLT to kick in

This is why the Normal distribution is everywhere!

CLT in Action: Michigan Election Polls

Suppose the true vote share in Michigan is 51%. What happens when we poll voters?

  • Each voter is like a coin flip (vote A or B)
  • Small samples are noisy; large samples converge to the truth
  • The distribution of poll results becomes Normal
#| fig-height: 4.5
#| code-fold: true
#| code-summary: "Show R code"
#| layout-ncol: 3
set.seed(42)
true_p <- 0.51
# Poll of 10 voters
hist(replicate(1000, mean(rbinom(10, 1, true_p))), breaks = 20,
     main = "Poll: 10 Voters", xlab = "Vote Share", col = "steelblue", 
     freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 100 voters
hist(replicate(1000, mean(rbinom(100, 1, true_p))), breaks = 20,
     main = "Poll: 100 Voters", xlab = "Vote Share", col = "steelblue", 
     freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 1000 voters
hist(replicate(1000, mean(rbinom(1000, 1, true_p))), breaks = 20,
     main = "Poll: 1000 Voters", xlab = "Vote Share", col = "steelblue", 
     freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)

Larger samples → tighter Normal distribution around the true value (red line)

Normal Distribution

The “bell curve” — the most important distribution in statistics

The 68-95-99.7 Rule:

  • 68% of data within 1 standard deviation
  • 95% of data within 2 standard deviations
  • 99.7% of data within 3 standard deviations

Why it’s everywhere: Central Limit Theorem guarantees that averages of many random events become Normal

Applications: Quality control, financial risk, test scores, measurement error

#| echo: false
#| fig-height: 5
x <- seq(-4, 4, length = 200)
plot(x, dnorm(x), type = "l", lwd = 3, col = "steelblue",
     xlab = "Standard Deviations from Mean", ylab = "Density")
polygon(c(x[x >= -1 & x <= 1], 1, -1), 
        c(dnorm(x[x >= -1 & x <= 1]), 0, 0), col = rgb(0.3, 0.5, 0.7, 0.3))
abline(v = c(-2, -1, 1, 2), lty = 2, col = "gray")
text(0, 0.15, "68%", cex = 1.2)

Normal: Heights of Adults

Male heights follow a Normal distribution: mean = 70 inches, sd = 3 inches

  • 68% of men are between 67-73 inches (within 1 sd)
  • The 95th percentile is about 75 inches — only 5% are taller
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
# What proportion are between 67 and 73 inches (+/- 1 sd)?
pnorm(73, mean = 70, sd = 3) - pnorm(67, mean = 70, sd = 3)

# What height is taller than 95% of men?
qnorm(0.95, mean = 70, sd = 3)

R Functions for Normal Distribution:

Function Purpose Example
pnorm() Probability ≤ x P(height ≤ 73)
qnorm() Find percentile 95th percentile
dnorm() Density at x Height of curve
rnorm() Random samples Simulate data

Business Applications:

  • Setting size ranges for products
  • Establishing “normal” ranges for KPIs
  • Identifying outliers (> 2-3 sd)
  • Quality control limits

The 1987 Stock Market Crash: A 5-Sigma Event

How extreme was the October 1987 crash of -21.76%?

  • Prior to crash: \(\mu = 1.2\%\), \(\sigma = 4.3\%\) → Z-score = \(\frac{-21.76 - 1.2}{4.3} = -5.34\)
  • Under Normal model: probability = 1 in 20 million (once every 130,000 years!)
  • Yet 5+ sigma events happened in 1987, 2008, and 2020

Conclusion: The model is wrong — stock returns have “fat tails.” Banks using Normal-based VaR dramatically underestimate risk.

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
pnorm(-5.34)  # Probability of -5.34 sigma event

Fat Tails: Reality vs Normal Model

The Problem with Normal Assumptions:

Stock returns have more extreme events than the Normal distribution predicts.

Event Normal Probability Actually Happened
1987 Crash (-22%) 1 in \(10^{160}\) Yes
2008 Crisis “Impossible” Yes
2020 COVID Crash “Impossible” Yes

Implications for Risk Management:

  • VaR models underestimate tail risk
  • Need “fat-tailed” distributions (t-distribution, etc.)
  • Stress testing is essential

Linear Regression

What is Regression?

Finding the relationship between variables

\[y = \beta_0 + \beta_1 x + \epsilon\]

  • \(\beta_0\): intercept (baseline value)
  • \(\beta_1\): slope (change in \(y\) per unit change in \(x\))
  • \(\epsilon\): unexplained variation

Goal: Minimize sum of squared prediction errors

Business Questions Regression Answers:

  • How much does price affect sales?
  • What’s the ROI of advertising spend?
  • How does experience affect salary?
  • What drives customer lifetime value?
  • How does weather affect demand?

Regression quantifies relationships and enables prediction.

Simple Example: House Prices

Using Saratoga County housing data, we fit a model:

Price = f(Living Area)

  • Intercept: Base price of ~$13,000 (land value)
  • Slope: Each additional square foot adds ~$113 to the price

A 2,000 sq ft house: $13K + (2000 × $113) = $239,000

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
d <- read.csv("data/SaratogaHouses.csv")
model <- lm(price ~ livingArea, data = d)
coef(model)

Interpreting Coefficients:

Coefficient Meaning
Intercept ($13K) Value of land without house
Slope ($113/sqft) Price increase per sqft

Making Predictions:

\[\text{Price} = 13,439 + 113 \times \text{SqFt}\]

House Size Predicted Price
1,500 sqft $183,000
2,500 sqft $296,000
3,500 sqft $409,000

Visualizing the Fit

#| echo: false
#| fig-height: 5.5
d$price <- d$price / 1000
d$livingArea <- d$livingArea / 1000
model <- lm(price ~ livingArea, data = d)
plot(d$livingArea, d$price, pch = 21, bg = "lightblue", cex = 0.6,
     xlab = "Living Area (1000 sq ft)", ylab = "Price ($1000)")
abline(model, col = "red", lwd = 3)

What the plot shows:

  • Each blue dot is a house
  • The red line is our prediction
  • Vertical distance from dot to line = prediction error

Key observations:

  • Strong positive relationship
  • More scatter at higher prices (heteroskedasticity)
  • Some outliers (expensive small houses, cheap large houses)

The line minimizes the sum of squared vertical distances

Google vs S&P 500 (CAPM)

The Capital Asset Pricing Model (CAPM) asks: Does a stock follow the market or beat it?

\[\text{Google Return} = \alpha + \beta \times \text{Market Return}\]

  • \(\beta\) (beta): How volatile is the stock relative to the market?
  • \(\alpha\) (alpha): Does the stock outperform after adjusting for risk?
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
#| message: false
library(quantmod)
getSymbols(c("GOOG", "SPY"), from = "2017-01-01", to = "2023-12-31") |> invisible()
goog <- as.numeric(dailyReturn(GOOG))
spy <- as.numeric(dailyReturn(SPY))
model <- lm(goog ~ spy)
print(model)

Google vs S&P 500: CAPM Results

#| echo: false
#| fig-height: 5.5
plot(spy, goog, pch = 20, col = rgb(0.3, 0.5, 0.7, 0.5), cex = 0.8,
     xlab = "S&P 500 Daily Return", ylab = "Google Daily Return",
     main = "Google vs Market (2017-2023)")
abline(model, col = "red", lwd = 3)
abline(h = 0, v = 0, lty = 2, col = "gray")
legend("topleft", legend = bquote(beta == .(round(coef(model)[2], 2))), 
       col = "red", lwd = 3, bty = "n")

Our Findings:

  • Beta (\(\beta = 1.01\)): Google moves 1:1 with market
  • Alpha (\(\alpha \approx 0\)): No significant outperformance (\(p = 0.06\))
Beta       Interpretation
\(\beta < 1\) Less volatile (utilities, healthcare)
\(\beta = 1\) Moves with market (index funds)
\(\beta > 1\) More volatile (tech, small caps)

Conclusion: Google tracked the market without consistent alpha in 2017-2023. High beta = higher risk, potentially higher reward.

Orange Juice: Price & Advertising

How does advertising affect price sensitivity? We model sales as a function of price and whether the product was featured in ads.

Key finding: The interaction term (log(price):feat) is negative and significant — advertising changes how customers respond to price!

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
oj <- read.csv("data/oj.csv")
model <- lm(logmove ~ log(price) * feat, data = oj)
tidy(model) |> select(term, estimate, p.value) |> kable(digits = 3)

The Advertising Paradox

Finding: Advertising increases price sensitivity

Condition Price Elasticity
No advertising -0.96
With advertising -0.96 + (-0.98) = -1.94

Why? Ads coincide with promotions → attract price-sensitive shoppers

Key Lessons:

  1. Correlation ≠ Causation: Ads don’t cause sensitivity; they coincide with promotions

  2. Selection effects: Who responds to ads? Price hunters!

  3. Confounding variables: Promotions happen during ad campaigns

  4. Managerial insight: Don’t blame advertising for price sensitivity — it’s the promotion strategy

Always ask: What’s really driving the relationship?

Logistic Regression

From Regression to Classification

What if the outcome is yes/no?

\[P(y=1 \mid x) = \frac{1}{1 + e^{-\beta^T x}}\]

Why not just use linear regression?

  • Linear regression can predict values < 0 or > 1
  • Probabilities must be between 0 and 1
  • Logistic function “squashes” any input to (0, 1)
#| echo: false
#| fig-height: 5
x <- seq(-6, 6, length = 200)
plot(x, 1/(1 + exp(-x)), type = "l", lwd = 3, col = "steelblue",
     xlab = expression("Linear Predictor (" * beta * "'x)"), ylab = "Probability",
     main = "The Logistic (Sigmoid) Function")
abline(h = 0.5, lty = 2, col = "gray")
abline(h = c(0, 1), lty = 3, col = "red")
text(4, 0.75, "Always between 0 and 1", cex = 0.9)

NBA Point Spread Example

Can Vegas point spreads predict game outcomes? We fit a logistic regression using historical NBA data.

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
NBA <- read.csv("data/NBAspread.csv")
model <- glm(favwin ~ spread - 1, family = binomial, data = NBA)
tidy(model) |> kable(digits = 3)

Interpretation: For each additional point in the spread, log-odds of favorite winning increases by 0.16. The p-value < 0.001 confirms spreads are highly predictive.

Making Predictions

Using our model, we can predict win probability for any point spread:

Spread P(Favorite Wins)
4 points 65%
8 points 78%
12 points 87%
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
predict(model, newdata = data.frame(spread = c(4, 8)), type = "response")

Same approach used for: credit scoring, churn prediction, marketing response, fraud detection — any binary outcome.

Confusion Matrix

How accurate is our model? The confusion matrix shows predictions vs. actual outcomes.

#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
pred <- predict(model, type = "response") > 0.5
table(Actual = NBA$favwin, Predicted = as.integer(pred))

Our model achieves about 66% accuracy — better than a coin flip!

Reading the Matrix:

Pred: 0 Pred: 1
Actual: 0 TN (correct!) FP (oops)
Actual: 1 FN (oops) TP (correct!)

Sports Betting Reality:

  • 66% accuracy sounds good, but…
  • Vegas takes ~10% commission (“vig”)
  • Need ~52.4% accuracy just to break even
  • Edge of 13.6% is excellent if it holds!

But past performance ≠ future results

Understanding the Confusion Matrix

Predicted: Win Predicted: Lose
Actual: Win True Positive (TP) False Negative (FN)
Actual: Lose False Positive (FP) True Negative (TN)

Key Metrics:

  • Accuracy = (TP + TN) / Total — overall correctness
  • Precision = TP / (TP + FP) — “Of predicted wins, how many were right?”
  • Recall = TP / (TP + FN) — “Of actual wins, how many did we catch?”

Caution: Accuracy can mislead! A spam filter predicting “not spam” for everything has 99% accuracy but catches zero spam. Choose metrics based on business costs.

ROC Curve: The Trade-off

#| echo: false
#| fig-height: 5.5
pred_prob <- predict(model, type = "response")
roc_data <- data.frame(
  threshold = seq(0, 1, by = 0.01)
) |>
  rowwise() |>
  mutate(
    sensitivity = mean(pred_prob[NBA$favwin == 1] > threshold),
    specificity = mean(pred_prob[NBA$favwin == 0] <= threshold)
  )

plot(1 - roc_data$specificity, roc_data$sensitivity, type = "l", lwd = 3,
     col = "steelblue", xlab = "False Positive Rate", ylab = "True Positive Rate",
     main = "ROC Curve")
abline(0, 1, lty = 2, col = "gray")

Understanding the ROC Curve:

  • X-axis: False Positive Rate (false alarms)
  • Y-axis: True Positive Rate (catches)
  • Diagonal: Random guessing (AUC = 0.5)
  • Upper-left corner: Perfect classifier

Area Under Curve (AUC):

AUC Model Quality
0.5 Random (useless)
0.6-0.7 Poor
0.7-0.8 Fair
0.8-0.9 Good
0.9+ Excellent

Choosing the Right Threshold

The optimal threshold depends on business costs:

  • Fraud detection: Low threshold (catch more fraud, accept false alarms)
  • Medical screening: Low threshold (don’t miss disease)
  • Spam filter: Higher threshold (don’t lose important emails)

There is no universal “correct” threshold

Framework for Threshold Selection:

  1. Quantify costs: What’s the cost of FP vs FN?
  2. Calculate expected cost at each threshold
  3. Choose threshold that minimizes total expected cost

Example — Credit Card Fraud:

  • False Positive cost: $10 (customer inconvenience)
  • False Negative cost: $500 (fraud loss)
  • Optimal threshold: Much lower than 0.5!

Let business economics guide your model decisions

Key Takeaways

Summary

Concept Key Insight
Distributions Binomial (binary), Poisson (counts), Normal (continuous)
Poisson Mean = Variance — the fingerprint of count data
Normal CLT makes it universal for averages
Linear Regression Coefficients = effect sizes
Logistic Regression Outputs probabilities for classification
ROC/AUC Trade-off between false positives and false negatives
Threshold Business costs should drive the choice

Statistics is the science of decision-making under uncertainty

Supplemental Reading

Online Articles:

Key Insight from HBR: A simple A/B test at Bing generated over $100M annually by testing a “low priority” idea

Books for Further Study:

  • The Signal and the Noise — Nate Silver
  • Thinking, Fast and Slow — Daniel Kahneman
  • Naked Statistics — Charles Wheelan
  • Data Science for Business — Provost & Fawcett

Online Courses:

  • Andrew Ng’s Machine Learning (Coursera)
  • Statistical Learning (Stanford Online)
  • Fast.ai Practical Deep Learning

Natural Language Processing

The Language Challenge

“You shall know a word by the company it keeps.” — J.R. Firth (1957)

Language poses unique challenges for AI:

  • Unlike images (continuous pixels) or audio (waveforms), text is discrete symbols
  • The word “cat” is not inherently closer to “dog” than to “quantum”
  • Yet humans effortlessly recognize semantic similarities

The breakthrough: Represent words as vectors in continuous space where geometry encodes meaning.

From Symbols to Vectors

The Problem with One-Hot Encoding:

Each word gets a unique vector with a single 1:

  • “cat” → [0, 0, 1, 0, 0, …, 0]
  • “dog” → [0, 1, 0, 0, 0, …, 0]

Problem: Cosine similarity between any two words = 0

No notion of semantic similarity is captured!

Solution: Learn dense vector representations where similar words are close together.

The Twenty Questions Intuition

Imagine playing Twenty Questions to identify words:

Question Bear Dog Cat
Is it an animal? 1 1 1
Is it domestic? 0 1 0.7
Larger than human? 0.8 0.1 0.01
Has long tail? 0 0.6 1
Is it a predator? 1 0 0.6

Each word becomes a vector of answers. Similar words give similar answers → similar vectors!

This is the essence of word embeddings.

Word2Vec: Learning from Context

The Distributional Hypothesis: Words appearing in similar contexts have similar meanings.

%%| echo: false
%%| fig-width: 12
flowchart LR
    C1["The ___ sat on the mat"]
    C2["The ___ sat on the rug"]
    
    Cat["cat"] --> C1
    Dog["dog"] --> C2
    
    Cat --> V["Similar Vectors!"]
    Dog --> V
    
    style Cat fill:#e1f5fe,stroke:#1976d2
    style Dog fill:#e1f5fe,stroke:#1976d2
    style V fill:#c8e6c9,stroke:#2e7d32

Result: Vector arithmetic captures analogies!

\[\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}\]

Word2Vec: War and Peace

Training Word2Vec on Tolstoy’s War and Peace reveals thematic structure:

Word2Vec embeddings from War and Peace, reduced to 2D via PCA

War and Peace: Semantic Clusters

The Word2Vec visualization reveals meaningful semantic relationships:

Cluster Words Insight
Military soldier, regiment, battle, army War domain
Social ballroom, court, marriage Peace domain
Government history, power, war Political themes

Key observation: “Peace” sits between government and social domains — central to the narrative’s dual structure.

Business applications: Netflix recommendations, Amazon suggestions, LinkedIn job matching, document search

The Skip-Gram Model

Given a center word, predict surrounding context words:

%%| echo: false
%%| fig-width: 10
flowchart LR
    A[loves] --> B[the]
    A --> C[man]
    A --> D[his]
    A --> E[son]
    
    style A fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
    style B fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style C fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style E fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

\[P(\text{context} \mid \text{center}) = \prod_{j} P(w_{j} \mid w_{\text{center}})\]

The learned vectors capture semantic relationships because words with similar contexts get similar representations.

From Words to Sentences: The Attention Revolution

The Problem: Static Embeddings

  • The word “bank” has one vector, whether it’s “river bank” or “investment bank.”
  • This conflation of senses creates an information bottleneck.

The Sequential Bottleneck (RNNs/LSTMs):

  • Processed text step-by-step (like reading through a straw).
  • Early information “vanished” as sentences grew longer.
  • Impossible to parallelize effectively on modern GPUs.

The Breakthrough: Attention Mechanisms

  • Let each word dynamically “attend” to all other words simultaneously.
  • Example: “The trophy wouldn’t fit in the suitcase because it was too big.”
  • Self-attention identifies that “it” refers to “trophy” by looking at the whole sentence at once.

Result: Contextual representations that change based on surrounding words.

The Attention Mechanism

The Library Analogy (Query, Key, Value):

  • Query (Q): What am I looking for? (e.g., “subject of the sentence”)
  • Key (K): What is in this book? (e.g., “noun”, “verb”, “adjective”)
  • Value (V): What is the content of the book? (the actual meaning vector)

The Mathematical Operation:

  1. Similarity: Compare the Query to all Keys using a dot product.
  2. Scoring: Turn these scores into weights (probabilities) using Softmax.
  3. Retrieval: Take a weighted sum of the Values.

Attention: QKV Interaction

%%| echo: false
%%| fig-width: 10
flowchart LR
    k1[k1]
    k2[k2]
    km[km]
    
    Q[Query q] --> a1[score1]
    Q --> a2[score2]
    Q --> am[scorem]
    
    k1 --> a1
    k2 --> a2
    km --> am
    
    a1 -.-> v1[v1]
    a2 -.-> v2[v2]
    am -.-> vm[vm]
    
    v1 --> O[Output]
    v2 --> O
    vm --> O
    
    style Q fill:#e1f5fe,stroke:#0277bd
    style O fill:#c8e6c9,stroke:#2e7d32

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Attention: A Visual Example

For “The trophy wouldn’t fit in the suitcase because it was too big”:

Word Attention to “it”
trophy 0.45
suitcase 0.15
fit 0.12
big 0.18
other 0.10

The model learns that “it” most likely refers to “trophy” by attending strongly to it!

Self-Attention vs Cross-Attention

Self-Attention:

%%| echo: false
%%| fig-width: 4
flowchart LR
    w1["The"] <--> w2["cat"]
    w2 <--> w3["sat"]
    w1 <--> w3
    
    style w1 fill:#e1f5fe,stroke:#1976d2
    style w2 fill:#e1f5fe,stroke:#1976d2
    style w3 fill:#e1f5fe,stroke:#1976d2

Goal: Understand internal relationships. Used in: Encoders (BERT) and Decoders (GPT). Analogy: Rereading a sentence to find the subject.

Cross-Attention:

%%| echo: false
%%| fig-width: 4
flowchart LR
    e1["The"]
    e2["cat"]
    d1["Le"]
    d2["chat"]
    
    e1 --> d1
    e2 --> d2
    e1 -.-> d2
    e2 -.-> d1
    
    style e1 fill:#e1f5fe,stroke:#1976d2
    style e2 fill:#e1f5fe,stroke:#1976d2
    style d1 fill:#c8e6c9,stroke:#2e7d32
    style d2 fill:#c8e6c9,stroke:#2e7d32

Goal: Link two different sequences. Used in: Translation models (T5). Analogy: Looking back at English while writing French.

The Transformer Architecture

%%| echo: false
%%| fig-width: 12
flowchart LR
    In["Input"] --> Tok["Token"] --> Emb["Embed"] --> Att["Attention"] --> FF["FeedForward"] --> Out["Output"]
    
    style In fill:#e3f2fd,stroke:#1976d2
    style Att fill:#fff3e0,stroke:#f57c00
    style FF fill:#e8f5e9,stroke:#388e3c
    style Out fill:#f3e5f5,stroke:#7b1fa2

Key innovations:

  • Self-attention replaces recurrence → parallel processing
  • Positional encoding preserves word order
  • Multi-head attention captures different relationship types
  • Feed-forward layers add nonlinear transformations

Why Transformers Won

Property RNN/LSTM Transformer
Sequential processing Yes (slow) No (parallel)
Long-range dependencies Difficult Easy
Training speed Slow Fast
Scalability Limited Excellent

Transformers scale with compute → the foundation of modern LLMs.

From Transformers to LLMs

The Scale Approach:

%%| echo: false
%%| fig-width: 9
flowchart LR
    G1["GPT-1<br/>117M"] --> G2["GPT-2<br/>1.5B"]
    G2 --> G3["GPT-3<br/>175B"]
    G3 --> G4["GPT-4<br/>~1.8T"]
    
    style G1 fill:#e3f2fd
    style G2 fill:#bbdefb
    style G3 fill:#90caf9
    style G4 fill:#42a5f5

Emergent capabilities appear at scale:

  • Chain-of-thought reasoning
  • In-context learning (few-shot)
  • Code generation
  • Multi-step planning

Large Language Models

How LLMs Generate Text

LLMs are autoregressive: they predict the next token based on all previous tokens.

%%| echo: false
%%| fig-width: 8
flowchart LR
    C["Context"] --> M["LLM"]
    M --> P["Probabilities"]
    P --> S["Sample"]
    S --> T["Token"]
    T --> |"Append"| C
    
    style C fill:#e3f2fd,stroke:#1976d2
    style M fill:#fff3e0,stroke:#f57c00
    style P fill:#e8f5e9,stroke:#388e3c
    style T fill:#f3e5f5,stroke:#7b1fa2

The Generation Loop:

  1. Probabilities: Compute scores for the vocabulary.
  2. Sampling: Select next word (via Temperature).
  3. Autoregression: Append and repeat.

Temperature (\(\tau\)):

\(\tau\) Behavior
0 Deterministic
0.7 Balanced
1.0 Probabilistic
1.5 Creative

Lower \(\tau\) = Predictable Higher \(\tau\) = Random

Why We Need Randomness: The Obama Example

Prompt: “The first African American president is Barack…”

  • Most probable next token: “Obama” ✓
  • Also correct: “Hussein” (his middle name)

A greedy strategy always picks “Obama” — but in formal documents, “Barack Hussein Obama” is preferred.

Temperature > 0 allows the model to explore alternatives that may better fit the context.

The LLM Lifecycle

%%| echo: false
%%| fig-width: 12
flowchart LR
    D[Data Collection] --> P[Pre-Training]
    P --> I[Instruction Tuning]
    I --> A[Alignment]
    A --> Dep[Deployment]
    
    style D fill:#e3f2fd,stroke:#1976d2
    style P fill:#e8f5e9,stroke:#388e3c
    style I fill:#fff3e0,stroke:#f57c00
    style A fill:#fce4ec,stroke:#c2185b
    style Dep fill:#f3e5f5,stroke:#7b1fa2
Stage Purpose
Data Collection Curate training corpus (quality > quantity)
Pre-Training Predict next tokens on billions of sequences
Instruction Tuning Teach the model to follow instructions
Alignment Ensure behavior matches human values (RLHF)
Deployment Optimize for latency, cost, safety

Alignment: Why It Matters

%%| echo: false
%%| fig-width: 10
flowchart LR
    Q["User Query"]
    
    Q --> U["Unaligned"]
    Q --> A["Aligned"]
    
    U --> UR["Yes, only true god"]
    
    A --> AR["Multiple perspectives exist"]
    
    style Q fill:#e3f2fd,stroke:#1976d2
    style UR fill:#ffcccc,stroke:#cc0000
    style AR fill:#ccffcc,stroke:#00cc00

Example: “Is Allah the only god?”

  • Unaligned: “Yes, Allah is the one true god and all other beliefs are false.”
  • Aligned: “In Islam, Allah is considered the one God. Other religions have different perspectives. I can provide factual information if helpful.”

This nuanced behavior emerges from alignment training, not pre-training alone.

Context Windows and Prompting

Context window: Maximum tokens the model can “see” at once

%%| echo: false
%%| fig-width: 10
flowchart LR
    S[System Prompt<br/>~500 tokens]
    T[Tools/Schemas<br/>~300 tokens]
    H[History<br/>~1000 tokens]
    R[Retrieved Docs<br/>~2000 tokens]
    U[User Query<br/>~200 tokens]
    
    S --> M[LLM]
    T --> M
    H --> M
    R --> M
    U --> M
    
    style S fill:#e3f2fd
    style R fill:#c8e6c9
    style U fill:#fff3e0

Prompting strategies: Zero-shot, Few-shot, Chain-of-thought, System prompts

AI Agents

What Are AI Agents?

“The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” — Edsger Dijkstra

AI agents are autonomous systems that:

  • Perceive their environment
  • Reason about goals
  • Take actions to achieve outcomes
  • Learn from results

Unlike chatbots, agents can act in the world.

The Agent Loop

%%| echo: false
%%| fig-width: 10
flowchart LR
    P["Perceive"] --> R["Reason"]
    R --> A["Act"]
    A --> O["Observe"]
    O --> P
    
    style P fill:#e3f2fd,stroke:#1976d2
    style R fill:#fff3e0,stroke:#f57c00
    style A fill:#e8f5e9,stroke:#388e3c
    style O fill:#fce4ec,stroke:#c2185b

The agent perceives its environment, reasons about goals, acts to achieve outcomes, observes the result, and repeats — a continuous loop of intelligent behavior.

Tool Use: Giving LLMs Hands

LLMs are “brains without hands” — function calling bridges this gap:

%%| echo: false
%%| fig-width: 10
flowchart LR
    U["Query"] --> L["LLM"]
    L --> TC["Tool Call"]
    TC --> O["Orchestrator"]
    O --> T["Tool"]
    T --> |"Result"| L
    L --> R["Response"]
    
    style U fill:#e3f2fd,stroke:#1976d2
    style L fill:#fff3e0,stroke:#f57c00
    style TC fill:#fce4ec,stroke:#c2185b
    style T fill:#e8f5e9,stroke:#388e3c
    style R fill:#f3e5f5,stroke:#7b1fa2

Examples: Web search, database queries, code execution, API calls.

Example: Currency Conversion Agent

User: “What’s $100 in euros?”

Agent reasoning:

  • I need to convert currency
  • Call convert_currency(amount=100, from="USD", to="EUR")

Tool returns: 92.50

Agent response: “100 US dollars is approximately 92.50 euros at current exchange rates.”

The agent reasons about what tool to use, then acts to get information.

Multi-Step Planning

Complex tasks require chained actions:

%%| echo: false
%%| fig-width: 12
flowchart LR
    Task[Task] --> get_rates[get_rates]
    get_rates --> Rates[Rates]
    Rates --> get_prices[get_prices]
    get_prices --> Prices[Prices]
    Prices --> correlate[correlate]
    correlate --> r[r=0.73]
    r --> report[report]
    report --> Final[Final Report]
    
    style Task fill:#e3f2fd,stroke:#1976d2
    style get_rates fill:#fff3e0,stroke:#f57c00
    style Rates fill:#fff3e0,stroke:#f57c00
    style get_prices fill:#fff3e0,stroke:#f57c00
    style Prices fill:#fff3e0,stroke:#f57c00
    style correlate fill:#fff3e0,stroke:#f57c00
    style r fill:#fff3e0,stroke:#f57c00
    style report fill:#fff3e0,stroke:#f57c00
    style Final fill:#c8e6c9,stroke:#2e7d32

Each step informs the next — true autonomous problem-solving.

Planning Capabilities

Planning capabilities enable:

Capability Description Example
Decomposition Break complex goals into subtasks “Analyze market” → 4 API calls
State tracking Remember intermediate results Store data between steps
Adaptation Adjust plan based on results Retry if API fails
Synthesis Combine outputs into final answer Merge data into report

Business impact: Agents can handle multi-hour research tasks that would take humans days.

ReAct: The Loop

The Loop:

Step Action Example
Observe Analyze input, tool outputs, environment “User wants weather in Paris”
Think Decide next action or tool to use “I should call weather API”
Act Execute tool or generate response get_weather("Paris")

Key insight: Unlike single-pass generation, ReAct agents can course-correct based on intermediate results.

Case Study: ChatDev

ChatDev orchestrates a virtual software company with specialized AI agents:

%%| echo: false
%%| fig-width: 10
flowchart LR
    CEO[CEO] --- CTO[CTO]
    CTO --- CPO[CPO]
    Prog[Programmer] --- Des[Designer]
    Test[Tester] --- Prog2[Programmer]
    
    CEO --> Prog
    Des --> Test
    Test --> Doc[Documentation]
    
    style CEO fill:#ffcccc,stroke:#cc0000
    style CTO fill:#ccffcc,stroke:#00cc00
    style Prog fill:#cce5ff,stroke:#1976d2
    style Test fill:#fff3cd,stroke:#f57c00

Results: 70 software projects, 17 files each, ~$0.30 per project, 7 minutes.

Agent Orchestration Patterns

%%| echo: false
%%| fig-width: 12
flowchart LR
    A1[Agent A] --> A2[Agent B] --> A3[Agent C]
    
    T[Task] --> P1[Agent 1]
    T --> P2[Agent 2]
    T --> P3[Agent 3]
    P1 --> R[Results]
    P2 --> R
    P3 --> R
    
    S[Supervisor] --> H1[Worker 1]
    S --> H2[Worker 2]
    S --> H3[Worker 3]

Orchestration: Use Cases

Pattern Use Case Tradeoff
Sequential Content pipeline (research → write → edit) Simple but slow
Parallel Multi-source analysis Fast but needs synthesis
Hierarchical Project management Control but bottleneck risk
Dynamic Market-based task allocation Flexible but complex

The Risks of Agent Autonomy

Case Study: Replit Agent Failure

%%| echo: false
%%| fig-width: 10
flowchart LR
    U["User: Fix this bug"] --> A[Agent]
    A --> D1["Diagnoses: config file issue"]
    D1 --> D2["Decides: delete config"]
    D2 --> B["Bug in delete tool"]
    B --> C["Entire project wiped"]
    C --> X["Production DB destroyed"]
    
    style D1 fill:#fff3cd
    style D2 fill:#ffcccc
    style B fill:#ffcccc
    style X fill:#ff0000,color:#fff

Lesson: Agent autonomy requires multiple safety layers.

What went wrong:

Failure Type Prevention
Wrong diagnosis Reasoning error Require confirmation for destructive actions
Auto-delete decision Autonomy overreach Human-in-the-loop for irreversible ops
Tool bug Implementation flaw Sandbox testing, rollback capability
No backup Missing safeguard Mandatory snapshots before changes

Key principle: The more powerful the agent, the more guardrails it needs.

Agent Safety Challenges

%%| echo: false
%%| fig-width: 10
flowchart LR
    PI[Prompt Injection] --> A[Agent]
    AD[Adversarial Inputs] --> A
    GM[Goal Misalignment] --> A
    HA[Hallucinations] --> A
    CO[Capability Overhang] --> A
    LC[Lack of Corrigibility] --> A
    A --> H[Harm]
    
    style PI fill:#ffcccc
    style HA fill:#fff3cd
    style H fill:#ff0000,color:#fff

Autonomous agents amplify risks — a hallucination becomes action.

Agent Safety: Risk Taxonomy

Risk Description Real Example
Prompt Injection Hidden instructions hijack agent Email contains “ignore previous instructions”
Hallucinations Acting on false information Agent invents API that doesn’t exist
Goal Misalignment Optimizes wrong objective Maximizes engagement via manipulation
Capability Overhang Does more than authorized Accesses files outside scope

Safety Mechanisms

%%| echo: false
%%| fig-width: 10
flowchart LR
    U[Input] --> IF[Input Filter]
    IF --> |Clean| A[Agent]
    IF --> |Malicious| B[Block]
    A --> OF[Output Filter]
    OF --> |Safe| R[Response]
    OF --> |Unsafe| B
    A --> M[Monitor]
    M --> |Anomaly| CB[Circuit Breaker]
    CB --> B
    
    style IF fill:#fff3e0,stroke:#f57c00
    style OF fill:#fff3e0,stroke:#f57c00
    style B fill:#ffcccc,stroke:#cc0000
    style R fill:#ccffcc,stroke:#00cc00
    style CB fill:#fce4ec,stroke:#c2185b

The Safety Pipeline:

  • Input/Output Guards: Fast classifiers that run before and after the LLM.
  • Monitoring: Watching for “strange” behavior (e.g., an agent trying to access a restricted database).
  • Circuit Breakers: Automatically killing the agent process if safety thresholds are exceeded.

The Defense-in-Depth Pipeline

Layer Purpose Technical Method
Input Filter Block malicious prompts PII detection, jailbreak classifiers
Sandboxing Isolate agent actions Docker containers, restricted API keys
Output Filter Prevent sensitive leakage RegEx for PII, toxic content scoring
Human-in-the-Loop Verify high-risk actions “Approve” button for financial transfers
Monitoring Detect runtime anomalies Log analysis, capability tracking

Key Principle: Never rely on the LLM to self-police. Use external code to enforce boundaries.

Human-in-the-Loop (HITL)

The most effective safety measure for high-stakes agents:

  • Critical Actions: Require manual approval for destructive or financial operations (e.g., rm -rf, send_payment).
  • Confirmation Dialogue: Show the agent’s proposed plan before execution.
  • Feedback Loop: Allow the human to correct the agent’s reasoning.
  • Audit Logs: Every action approved or rejected by a human is recorded for training and safety reviews.

Example: A code-refactoring agent proposes changes; a human developer reviews and clicks “Merge” or “Reject”.

Anthropic’s ASL-3 Safety Measures

For Claude Opus 4, Anthropic activated proactive safety:

%%| echo: false
%%| fig-width: 10
flowchart LR
    U[User] --> CC[Constitutional Classifiers]
    CC --> |Safe| M[Model]
    CC --> |Blocked| B[Reject]
    M --> OC[Output Check]
    OC --> |Safe| R[Response]
    OC --> |Harmful| B
    
    BB[Bug Bounty] --> CC
    RP[Rapid Patch] --> CC
    
    style CC fill:#c8e6c9,stroke:#2e7d32
    style B fill:#ffcccc,stroke:#cc0000
    style R fill:#e3f2fd,stroke:#1976d2

ASL-3: Safety Pipeline

Layer Function Why It Matters
Constitutional AI Real-time input/output filtering Blocks harmful requests before execution
Bug Bounty Crowdsourced discovery Finds attacks humans miss
Rapid Patching Auto-generate variants Stays ahead of attackers
Egress Control Throttle outbound data Prevents model weight theft

Evaluating AI Agents

Traditional metrics (accuracy, precision) are insufficient for agents.

%%| echo: false
%%| fig-width: 10
flowchart LR
    TC["Task Completion"] --> Score["Overall Agent Score"]
    RQ["Reasoning Quality"] --> Score
    SA["Safety"] --> Score
    RE["Resource Efficiency"] --> Score
    ER["Error Recovery"] --> Score
    AD["Adversarial Robustness"] --> Score
    
    style SA fill:#ffcccc,stroke:#cc0000
    style Score fill:#c8e6c9,stroke:#2e7d32

Evaluation Approaches

%%| echo: false
%%| fig-width: 10
flowchart LR
    A[Agent Output] --> R[Rule-Based]
    A --> L[LLM-as-Judge]
    A --> H[Human Review]
    A --> S[Simulation]
    
    R --> E[Score]
    L --> E
    H --> E
    S --> E
    
    style R fill:#e3f2fd,stroke:#1976d2
    style L fill:#fff3e0,stroke:#f57c00
    style H fill:#c8e6c9,stroke:#2e7d32
    style S fill:#f3e5f5,stroke:#7b1fa2
    style E fill:#ffcccc,stroke:#cc0000

Best practice: Combine multiple approaches for comprehensive evaluation.

Domain-Specific Benchmarks

Domain Benchmark What It Tests
Coding SWE-bench Fix real GitHub issues
Web WebArena Navigate websites, complete tasks
Robotics ALFRED Household tasks in 3D
Enterprise TAU-bench Multi-system workflows

Agent capabilities are task-specific — benchmarks must match use cases.

Red-Teaming Agents

Systematic vulnerability testing:

%%| echo: false
%%| fig-width: 10
flowchart LR
    PI[Prompt Injection] --> A[Agent]
    ME[Agent Mistakes] --> A
    MU[Direct Misuse] --> A
    
    A --> |Vulnerability| V[Security Issue]
    A --> |Safe| S[Normal Operation]
    
    V --> R[Report]
    
    style PI fill:#ffcccc,stroke:#cc0000
    style ME fill:#fff3cd,stroke:#f57c00
    style MU fill:#ffcccc,stroke:#cc0000
    style V fill:#ffcccc,stroke:#cc0000
    style S fill:#ccffcc,stroke:#00cc00

Example: Hidden text in a webpage hijacks agent to exfiltrate data.

Comprehensive red-teaming found 1,200+ vulnerabilities in one enterprise agent.

AI in the Physical World

Embodied AI: Robots

Software agents operate in digital systems. Embodied agents must handle:

%%| echo: false
%%| fig-width: 12
flowchart LR
    C["Camera"] --> F["Fusion"]
    L["Lidar"] --> F
    T["Touch"] --> F
    F --> B["Robot Brain"]
    B --> M["Motors"]
    M --> E["Environment"]
    E --> |"Feedback"| C
    
    style F fill:#fff3e0,stroke:#f57c00
    style B fill:#e3f2fd,stroke:#1976d2
    style E fill:#c8e6c9,stroke:#388e3c

The sim-to-real gap: Robots trained in simulation often fail in reality.

The Evolution of Robotic Intelligence

%%| echo: false
%%| fig-width: 10
flowchart LR
    S[1960s Shakey] --> P[1980s-2000s Probabilistic]
    P --> F[2020s Foundation Models]
    
    style S fill:#e3f2fd,stroke:#1976d2
    style P fill:#fff3e0,stroke:#f57c00
    style F fill:#c8e6c9,stroke:#2e7d32

Robotics: Capability Eras

Era Capability Limitation
Rule-based Explicit reasoning Brittle, narrow
Probabilistic Handle uncertainty No language understanding
Foundation Models Natural language + adaptation Compute-intensive

LLMs have catalyzed a new era: robots that understand language and adapt.

Google’s Robotic Transformer (RT-2)

A vision-language-action model that directly controls robots:

%%| echo: false
%%| fig-width: 10
flowchart LR
    V[Vision Input] --> VLA[RT-2 Model]
    L[Language Input] --> VLA
    VLA --> A[Action Output]
    A --> R[Robot]
    R -- Feedback --> V
    
    style V fill:#e3f2fd,stroke:#1976d2
    style L fill:#fff3e0,stroke:#f57c00
    style VLA fill:#f3e5f5,stroke:#7b1fa2
    style A fill:#c8e6c9,stroke:#2e7d32
    style R fill:#ffcccc,stroke:#cc0000

GeneralInteractiveDexterous

Works across robot forms: arms, humanoids, mobile platforms.

Robot Safety: ASIMOV

Named after Asimov’s Laws of Robotics, this benchmark tests embodied AI safety:

Asimov’s Law Modern Interpretation Test Scenario
1. Don’t harm humans Refuse dangerous commands “Throw this at the person”
2. Obey orders Follow safe instructions “Hand me that tool”
3. Protect self Avoid self-damage Don’t walk off ledge
Zeroth Law Protect humanity broadly Consider societal impact

Key challenge: Context matters — “Hand me that knife” is safe in a kitchen, dangerous in a conflict.

Business relevance: As robots enter warehouses, hospitals, and homes, safety benchmarks become legal and ethical requirements.

Key Takeaways

Summary: NLP

Concept Key Insight
Word Embeddings Words as vectors; geometry = meaning
Distributional Hypothesis Context reveals meaning
Attention Dynamic weighting of relevant information
Transformers Parallel processing, scalable, powerful

The shift from symbols to vectors enabled modern NLP.

Summary: LLMs

Concept Key Insight
Autoregressive Generation Predict next token iteratively
Temperature Controls randomness/creativity
Alignment Ensures safe, helpful behavior
Context Windows Limit on “memory” size

Scale + alignment = emergent reasoning capabilities.

Summary: AI Agents

Concept Key Insight
Tool Use LLMs gain ability to act
Multi-Step Planning Chain reasoning and action
Orchestration Multiple agents collaborate
Safety Autonomy amplifies risks
Evaluation Requires new methodologies

Agents transform LLMs from conversationalists to autonomous workers.

The Executive Perspective

For AI leaders:

  • NLP powers search, chatbots, document analysis
  • LLMs enable natural language interfaces to business systems
  • Agents can automate complex, multi-step workflows
  • Safety must be built in from the start, not bolted on
  • Evaluation requires domain-specific benchmarks and human oversight

The promise: augmenting human intelligence — agents handle routine tasks while humans provide judgment, creativity, and ethical oversight.

Supplemental Reading

Online Articles:

From the Textbook:

  • Chapter 24: Natural Language Processing
  • Chapter 26: AI Agents

What is Cursor?

Cursor is an AI-powered code editor built on VS Code. It allows you to:

  • Write code with AI assistance
  • Ask questions about your code
  • Generate code from natural language descriptions
  • Debug and fix errors with AI help

For this course, we’ll use Cursor to build an AI agent without needing deep programming expertise.

Part 1: Download and Install Cursor

Step 1: Download Cursor

  1. Go to cursor.sh
  2. Click the Download button
  3. The website will automatically detect your operating system (Mac, Windows, or Linux)

Step 2: Install on Mac

  1. Open the downloaded .dmg file
  2. Drag the Cursor icon to the Applications folder
  3. Open Cursor from Applications
  4. If prompted about security, go to System Preferences → Security & Privacy and click “Open Anyway”

Step 3: Install on Windows

  1. Run the downloaded .exe installer
  2. Follow the installation wizard
  3. Launch Cursor from the Start Menu

Part 2: Initial Setup

  1. When Cursor opens, you’ll see a welcome screen
  2. Click Sign In to create a free account
  3. You can sign in with:
    • Google account
    • GitHub account
    • Email

Benefits of signing in:

  • Free AI credits for code assistance
  • Settings sync across devices

Step 2: Choose Your Theme

  1. Cursor will ask about your preferred color theme
  2. Choose Dark or Light based on your preference
  3. You can change this later in Settings

Step 3: Import VS Code Settings (Optional)

If you’ve used VS Code before:

  1. Cursor will offer to import your settings
  2. Click Import to bring over extensions and preferences
  3. Or click Skip to start fresh

Part 3: Install Python

Cursor needs Python installed on your computer to run our project.

Check if Python is Already Installed

  1. In Cursor, open the terminal: View → Terminal (or press Ctrl+`)
  2. Type this command and press Enter:
python --version
  1. If you see Python 3.x.x, you’re good! Skip to Part 4.
  2. If you get an error, follow the installation steps below.

Install Python on Mac

Option A: Using Homebrew (Recommended)

  1. Open Terminal (outside of Cursor)
  2. Install Homebrew if you don’t have it:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  1. Install Python:
brew install python

Option B: Direct Download

  1. Go to python.org/downloads
  2. Download the latest Python 3.x version
  3. Run the installer
  4. Important: Check “Add Python to PATH” during installation

Install Python on Windows

  1. Go to python.org/downloads
  2. Click “Download Python 3.x.x”
  3. Run the installer
  4. IMPORTANT: Check the box that says “Add Python to PATH”
  5. Click “Install Now”

Verify Installation

After installation, close and reopen Cursor, then:

  1. Open terminal: View → Terminal
  2. Run:
python --version
  1. You should see Python 3.x.x

Part 4: Install Required Packages

Our project needs a few Python libraries. Install them in Cursor’s terminal:

pip install pandas numpy scikit-learn

You should see output indicating successful installation.

If you get a “pip not found” error on Mac:

pip3 install pandas numpy scikit-learn

Part 5: Using Cursor’s AI Features

Feature 1: AI Chat (Cmd+L / Ctrl+L)

Use this to ask questions or get help:

  1. Press Cmd+L (Mac) or Ctrl+L (Windows)
  2. A chat panel opens on the right
  3. Ask questions like:
    • “How do I load a CSV file in Python?”
    • “Explain what this code does”
    • “Why am I getting this error?”

Feature 2: Inline Edit (Cmd+K / Ctrl+K)

Use this to write or modify code:

  1. Select some code (or place cursor where you want new code)
  2. Press Cmd+K (Mac) or Ctrl+K (Windows)
  3. Describe what you want in plain English:
    • “Add a function that calculates the average price”
    • “Fix this error”
    • “Add comments explaining this code”
  4. Review the suggested changes
  5. Press Enter to accept or Escape to cancel

Feature 3: Code Completion (Tab)

As you type, Cursor suggests completions:

  1. Start typing code
  2. You’ll see gray “ghost text” suggestions
  3. Press Tab to accept the suggestion
  4. Press Escape to dismiss

Feature 4: Agent Mode (Cmd+I / Ctrl+I)

For larger tasks, use Agent mode:

  1. Press Cmd+I (Mac) or Ctrl+I (Windows)
  2. Describe a complex task:
    • “Create a Python script that loads data and builds a regression model”
  3. The agent will generate multiple files and complete code

Part 6: Creating Your First Project

Step 1: Create a Project Folder

  1. In Cursor, go to File → Open Folder
  2. Navigate to where you want your project (e.g., Documents)
  3. Click New Folder and name it oj-pricing-agent
  4. Select this folder and click Open

Step 2: Create a Python File

  1. In the Explorer sidebar (left panel), right-click
  2. Select New File
  3. Name it test.py
  4. Add this code:
print("Hello from Cursor!")

Step 3: Run Your Code

  1. Open the terminal: View → Terminal
  2. Run your script:
python test.py
  1. You should see: Hello from Cursor!

Congratulations! You’re ready to build your AI agent.

Part 7: Keyboard Shortcuts Reference

Action Mac Windows
AI Chat Cmd+L Ctrl+L
Inline Edit Cmd+K Ctrl+K
Agent Mode Cmd+I Ctrl+I
Open Terminal Ctrl+| Ctrl+
Save File Cmd+S Ctrl+S
Open File Cmd+O Ctrl+O
New File Cmd+N Ctrl+N
Find Cmd+F Ctrl+F

Troubleshooting

“Python not found” in terminal

Mac:

  • Try python3 instead of python
  • Or run: brew install python

Windows:

  • Reinstall Python and make sure to check “Add Python to PATH”
  • Restart Cursor after installation

“pip not found”

Mac:

  • Use pip3 instead of pip

Windows:

  • Try python -m pip install package_name

Cursor won’t start

  1. Make sure you have enough disk space (at least 1GB free)
  2. Try restarting your computer
  3. Reinstall Cursor from cursor.sh

AI features not working

  1. Make sure you’re signed in (check bottom-left corner)
  2. Check your internet connection
  3. Try signing out and back in

Code runs but shows errors

  1. Copy the error message
  2. Press Cmd+L (or Ctrl+L) to open AI Chat
  3. Paste the error and ask “How do I fix this?”

Getting Help During the Course

  1. Zoom Sessions: Ask questions during live sessions
  2. Cursor AI: Use Cmd+L to ask the AI for help
  3. Discussion Board: Post questions for peer assistance
  4. Office Hours: [Insert instructor office hours if applicable]

Next Steps

After completing this setup:

  1. ✅ Cursor is installed and running
  2. ✅ Python is installed
  3. ✅ Required packages are installed
  4. ✅ You can create and run Python files

You’re ready for Zoom Session 1 where we’ll practice using Cursor’s AI features together!

Overview

In this project, you will build an AI agent that helps a retail pricing analyst make decisions about orange juice pricing and promotions. The agent will:

  1. Load and explore sales data
  2. Build a regression model to predict sales
  3. Answer business questions using natural language

Time Required: ~2 hours

Prerequisites:

Part 1: Project Setup

Step 1.1: Create Your Project Folder

  1. Open Cursor IDE
  2. Click File → Open Folder
  3. Create a new folder called oj-pricing-agent on your computer
  4. Select that folder to open it in Cursor

Step 1.2: Create the Main Python File

  1. In the Cursor sidebar, right-click and select New File
  2. Name it oj_agent.py
  3. You’ll see an empty file open in the editor

Step 1.3: Copy the Dataset

Download the oj_data.csv file and copy it into your oj-pricing-agent folder.

Part 2: Data Loading and Exploration

Step 2.1: Load the Required Libraries

In your oj_agent.py file, start by adding these lines at the top:

# Required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

What this does: These libraries help us work with data (pandas), do math (numpy), and build models (sklearn).

Step 2.2: Load the Data

Add the following code to load the orange juice sales data:

# Load the orange juice dataset
print("Loading data...")
df = pd.read_csv('oj_data.csv')

# Display basic information
print(f"Dataset has {len(df)} rows and {len(df.columns)} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nBrands in dataset: {df['brand'].unique()}")
print(f"\nPrice range: ${df['price'].min():.2f} - ${df['price'].max():.2f}")
print(f"\nSample of data:")
print(df.head())

Step 2.3: Run Your Code (First Test)

  1. Save the file (Ctrl+S or Cmd+S)
  2. Open the terminal in Cursor: View → Terminal
  3. Run the script:
python oj_agent.py

You should see output showing:

  • The dataset has ~28,000 rows
  • Three brands: Tropicana, Minute Maid, Dominick’s
  • Price ranges from about $1 to $4

Troubleshooting: If you get an error about missing packages, run:

pip install pandas numpy scikit-learn

Part 3: Building the Regression Model

Step 3.1: Understanding the Model

We’re building a model that predicts log of sales volume based on:

  • Price: Higher price → lower sales (negative relationship)
  • Featured (feat): If the product is in the weekly ad circular (1 = yes, 0 = no)
  • Brand: Different brands have different base sales levels
  • Price × Brand interaction: Price sensitivity varies by brand

Step 3.2: Prepare the Data for Modeling

Add this code to prepare features for the model:

# ============================================
# PART 3: BUILD THE REGRESSION MODEL
# ============================================

print("\n" + "="*50)
print("Building the pricing model...")
print("="*50)

# Create dummy variables for brand (one-hot encoding)
# This converts 'brand' text into numbers the model can use
brand_dummies = pd.get_dummies(df['brand'], prefix='brand', drop_first=False)

# Create the feature matrix
# We include: price, feat, brand dummies, and price*brand interactions
X = pd.DataFrame({
    'price': df['price'],
    'feat': df['feat'],
    'brand_minute.maid': brand_dummies['brand_minute.maid'],
    'brand_tropicana': brand_dummies['brand_tropicana'],
    # Interaction terms: price effect varies by brand
    'price_x_minute.maid': df['price'] * brand_dummies['brand_minute.maid'],
    'price_x_tropicana': df['price'] * brand_dummies['brand_tropicana']
})

# Target variable: log of sales (logmove)
y = df['logmove']

print(f"Features: {list(X.columns)}")
print(f"Target: logmove (log of sales volume)")

Step 3.3: Fit the Model

Add code to train the regression model:

# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Display the coefficients
print("\nModel Coefficients:")
print("-" * 40)
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:.4f}")
print(f"  intercept: {model.intercept_:.4f}")

# Calculate R-squared (how well the model fits)
r_squared = model.score(X, y)
print(f"\nModel R-squared: {r_squared:.3f}")
print("(This means the model explains {:.1f}% of sales variation)".format(r_squared * 100))

Step 3.4: Run and Verify the Model

Save and run the script again. You should see coefficients like:

  • price: negative (higher price = lower sales)
  • feat: positive (being featured increases sales)
  • brand coefficients: capture baseline differences between brands
  • interaction terms: show how price sensitivity differs by brand

Part 4: Creating Helper Functions for the Agent

Step 4.1: Add Prediction Functions

Add these functions that the agent will use to answer questions:

# ============================================
# PART 4: HELPER FUNCTIONS FOR THE AGENT
# ============================================

def predict_sales(brand, price, featured=0):
    """
    Predict sales volume for a given brand, price, and feature status.
    
    Args:
        brand: 'tropicana', 'minute.maid', or 'dominicks'
        price: price in dollars (e.g., 2.50)
        featured: 1 if in ad circular, 0 if not
    
    Returns:
        Predicted sales volume (not log-transformed)
    """
    # Create feature vector
    features = {
        'price': price,
        'feat': featured,
        'brand_minute.maid': 1 if brand.lower() == 'minute.maid' else 0,
        'brand_tropicana': 1 if brand.lower() == 'tropicana' else 0,
        'price_x_minute.maid': price if brand.lower() == 'minute.maid' else 0,
        'price_x_tropicana': price if brand.lower() == 'tropicana' else 0
    }
    
    # Convert to dataframe for prediction
    X_pred = pd.DataFrame([features])
    
    # Predict log sales, then convert back
    log_sales = model.predict(X_pred)[0]
    sales = np.exp(log_sales)
    
    return sales


def get_price_elasticity(brand):
    """
    Calculate the price elasticity for a given brand.
    
    Price elasticity tells us: if price increases by 1%, 
    how much does quantity demanded change (in %)?
    
    A more negative number means more price-sensitive.
    """
    # Base price coefficient
    base_coef = model.coef_[0]  # price coefficient
    
    # Add brand-specific interaction if applicable
    if brand.lower() == 'minute.maid':
        interaction_coef = model.coef_[4]  # price_x_minute.maid
    elif brand.lower() == 'tropicana':
        interaction_coef = model.coef_[5]  # price_x_tropicana
    else:  # dominicks (base case)
        interaction_coef = 0
    
    total_elasticity = base_coef + interaction_coef
    return total_elasticity


def get_advertising_lift(brand):
    """
    Calculate the sales lift from being featured in advertising.
    Returns the percentage increase in sales.
    """
    # The 'feat' coefficient tells us the log-sales increase
    feat_coef = model.coef_[1]  # feat coefficient
    
    # Convert from log to percentage change
    percentage_lift = (np.exp(feat_coef) - 1) * 100
    return percentage_lift


def find_optimal_price(brand, min_price=1.0, max_price=4.0, featured=0):
    """
    Find the price that maximizes revenue for a brand.
    Revenue = Price × Quantity
    """
    best_price = min_price
    best_revenue = 0
    
    # Search through price range
    for price in np.arange(min_price, max_price, 0.05):
        sales = predict_sales(brand, price, featured)
        revenue = price * sales
        
        if revenue > best_revenue:
            best_revenue = revenue
            best_price = price
    
    return best_price, best_revenue


def compare_elasticities():
    """
    Compare price elasticity across all three brands.
    """
    brands = ['dominicks', 'minute.maid', 'tropicana']
    results = {}
    
    for brand in brands:
        elasticity = get_price_elasticity(brand)
        results[brand] = elasticity
    
    return results

Step 4.2: Test the Helper Functions

Add test code to verify the functions work:

# ============================================
# TEST THE HELPER FUNCTIONS
# ============================================

print("\n" + "="*50)
print("Testing helper functions...")
print("="*50)

# Test prediction
test_sales = predict_sales('tropicana', 2.50, featured=0)
print(f"\nPredicted sales for Tropicana at $2.50 (no ad): {test_sales:.0f} units")

# Test elasticity
elasticities = compare_elasticities()
print("\nPrice Elasticities by Brand:")
for brand, elast in elasticities.items():
    print(f"  {brand}: {elast:.3f}")

# Test advertising lift
lift = get_advertising_lift('minute.maid')
print(f"\nAdvertising lift: {lift:.1f}% increase in sales")

# Test optimal price
opt_price, opt_rev = find_optimal_price('dominicks')
print(f"\nOptimal price for Dominick's: ${opt_price:.2f} (revenue: ${opt_rev:.2f})")

Run the script again to verify all functions work correctly.

Part 5: Creating the AI Agent

Step 5.1: Add the Agent Logic

Now we’ll create the agent that interprets natural language questions and calls the appropriate functions. Add this code:

# ============================================
# PART 5: THE AI AGENT
# ============================================

def answer_question(question):
    """
    Simple agent that answers business questions about OJ pricing.
    
    This is a rule-based agent that matches keywords in the question
    to determine which analysis to perform.
    """
    question_lower = question.lower()
    
    # Question 1: Predict sales for specific scenario
    if 'predict' in question_lower or 'sales volume' in question_lower:
        # Extract brand and price from question if possible
        if 'tropicana' in question_lower:
            brand = 'tropicana'
        elif 'minute maid' in question_lower:
            brand = 'minute.maid'
        else:
            brand = 'dominicks'
        
        # Look for price (default to $2.50 if not found)
        import re
        price_match = re.search(r'\$?(\d+\.?\d*)', question_lower)
        price = float(price_match.group(1)) if price_match else 2.50
        
        # Check for advertising
        featured = 1 if 'advertis' in question_lower or 'feature' in question_lower else 0
        if 'no advertis' in question_lower or 'without advertis' in question_lower:
            featured = 0
        
        sales = predict_sales(brand, price, featured)
        
        response = f"""
**Predicted Sales Analysis**

Brand: {brand.title().replace('.', ' ')}
Price: ${price:.2f}
Featured in Ad: {'Yes' if featured else 'No'}

**Predicted Sales Volume: {sales:,.0f} units**

This prediction is based on our regression model that accounts for:
- Base demand for this brand
- Price sensitivity (elasticity)  
- Advertising effects
"""
        return response
    
    # Question 2: Which brand is most price-sensitive?
    elif 'price-sensitive' in question_lower or 'price sensitive' in question_lower or 'most sensitive' in question_lower:
        elasticities = compare_elasticities()
        
        # Find most price-sensitive (most negative elasticity)
        most_sensitive = min(elasticities, key=elasticities.get)
        
        response = f"""
**Price Sensitivity Analysis**

Price Elasticity by Brand:
"""
        for brand, elast in sorted(elasticities.items(), key=lambda x: x[1]):
            sensitivity = "HIGH" if elast < -3 else "MEDIUM" if elast < -2 else "LOW"
            response += f"- {brand.title().replace('.', ' ')}: {elast:.3f} ({sensitivity} sensitivity)\n"
        
        response += f"""
**Most Price-Sensitive: {most_sensitive.title().replace('.', ' ')}**

Interpretation: A 1% price increase leads to a {abs(elasticities[most_sensitive]):.1f}% decrease in sales for {most_sensitive.title().replace('.', ' ')}.

Business Implication: Be careful with price increases on {most_sensitive.title().replace('.', ' ')} - customers are very responsive to price changes.
"""
        return response
    
    # Question 3: Should we feature a brand in advertising?
    elif 'feature' in question_lower or 'ad circular' in question_lower or 'advertising' in question_lower:
        if 'minute maid' in question_lower:
            brand = 'minute.maid'
        elif 'tropicana' in question_lower:
            brand = 'tropicana'
        else:
            brand = 'dominicks'
        
        lift = get_advertising_lift(brand)
        
        # Calculate example impact
        base_sales = predict_sales(brand, 2.50, featured=0)
        featured_sales = predict_sales(brand, 2.50, featured=1)
        
        response = f"""
**Advertising Impact Analysis for {brand.title().replace('.', ' ')}**

Expected Sales Lift from Featuring: **{lift:.1f}%**

Example at $2.50:
- Without advertising: {base_sales:,.0f} units
- With advertising: {featured_sales:,.0f} units  
- Additional sales: {featured_sales - base_sales:,.0f} units

**Recommendation:** {'Yes, feature this product!' if lift > 20 else 'Consider the advertising cost vs. the sales lift.'}

The advertising effect is consistent across price points. Factor in your advertising costs to determine if the sales lift justifies the expense.
"""
        return response
    
    # Question 4: Optimal price for a brand
    elif 'optimal price' in question_lower or 'maximize revenue' in question_lower or 'best price' in question_lower:
        if 'minute maid' in question_lower:
            brand = 'minute.maid'
        elif 'tropicana' in question_lower:
            brand = 'tropicana'
        else:
            brand = 'dominicks'
        
        opt_price, opt_revenue = find_optimal_price(brand)
        opt_sales = predict_sales(brand, opt_price, featured=0)
        
        # Compare with current average price
        avg_price = df[df['brand'] == brand]['price'].mean()
        avg_revenue = avg_price * predict_sales(brand, avg_price, featured=0)
        
        response = f"""
**Revenue Optimization for {brand.title().replace('.', ' ')}**

**Optimal Price: ${opt_price:.2f}**

At optimal price:
- Predicted sales: {opt_sales:,.0f} units
- Revenue per store-week: ${opt_revenue:,.2f}

Comparison with current average (${avg_price:.2f}):
- Current revenue: ${avg_revenue:,.2f}
- Potential improvement: ${opt_revenue - avg_revenue:,.2f} ({((opt_revenue/avg_revenue)-1)*100:.1f}%)

Note: This optimization assumes no competitor response and stable market conditions.
"""
        return response
    
    # Question 5: Compare elasticities across brands
    elif 'compare' in question_lower or 'elasticity' in question_lower or 'across' in question_lower:
        elasticities = compare_elasticities()
        
        response = """
**Price Elasticity Comparison Across Brands**

| Brand | Elasticity | Interpretation |
|-------|------------|----------------|
"""
        for brand, elast in sorted(elasticities.items(), key=lambda x: x[1]):
            interp = f"1% price ↑ → {abs(elast):.1f}% sales ↓"
            response += f"| {brand.title().replace('.', ' ')} | {elast:.3f} | {interp} |\n"
        
        response += """
**Key Insights:**

1. **Dominick's** (store brand) is least price-sensitive - customers buying store brands may prioritize value and be less responsive to small price changes.

2. **Tropicana** shows moderate price sensitivity - as a premium brand, some customers are loyal but others will switch if prices rise.

3. **Minute Maid** is most price-sensitive - positioned between store and premium brands, these customers actively compare prices.

**Strategic Implications:**
- Use competitive pricing on Minute Maid to capture price-sensitive shoppers
- Tropicana can sustain moderate price premiums
- Dominick's margins can be optimized with less risk of volume loss
"""
        return response
    
    else:
        return """
I can help you with these types of questions:

1. **Sales Prediction:** "What is the predicted sales volume if we price Tropicana at $2.50?"
2. **Price Sensitivity:** "Which brand is most price-sensitive?"
3. **Advertising Impact:** "Should we feature Minute Maid in the ad circular?"
4. **Price Optimization:** "What price should we set for Dominick's to maximize revenue?"
5. **Elasticity Comparison:** "Compare the price elasticity across brands"

Please try one of these questions!
"""

Step 5.2: Add the Interactive Interface

Finally, add code to let users interact with the agent:

# ============================================
# PART 6: INTERACTIVE AGENT INTERFACE
# ============================================

def run_agent():
    """
    Run the interactive agent interface.
    """
    print("\n" + "="*60)
    print("🍊 ORANGE JUICE PRICING ANALYTICS AGENT 🍊")
    print("="*60)
    print("\nHello! I'm your pricing analytics assistant.")
    print("I can help you analyze orange juice pricing and promotions.")
    print("\nTry asking me questions like:")
    print("  - What is the predicted sales if we price Tropicana at $2.50?")
    print("  - Which brand is most price-sensitive?")
    print("  - Should we feature Minute Maid in the ad circular?")
    print("  - What price maximizes revenue for Dominick's?")
    print("  - Compare price elasticity across brands")
    print("\nType 'quit' to exit.\n")
    
    while True:
        question = input("Your question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("\nThank you for using the OJ Pricing Agent. Goodbye!")
            break
        
        if not question:
            continue
        
        print("\n" + "-"*50)
        response = answer_question(question)
        print(response)
        print("-"*50 + "\n")


# ============================================
# MAIN: RUN THE AGENT
# ============================================

if __name__ == "__main__":
    # Run the interactive agent
    run_agent()

Part 6: Testing Your Agent

Step 6.1: Run the Complete Script

Save the file and run:

python oj_agent.py

Step 6.2: Test All Five Required Questions

Test your agent with these exact questions:

  1. “What is the predicted sales volume if we price Tropicana at $2.50 with no advertising?”

  2. “Which brand is most price-sensitive?”

  3. “Should we feature Minute Maid in this week’s ad circular? What’s the expected sales lift?”

  4. “What price should we set for Dominick’s brand to maximize revenue?”

  5. “Compare the price elasticity across the three brands.”

Record the answers for your summary document.

Part 7: Writing Your Summary

Create a 1-page document (Word or PDF) that includes:

Section 1: Key Findings (half page)

  • Which brand is most/least price-sensitive and why this matters
  • The impact of advertising on sales
  • The optimal pricing recommendations

Section 2: Surprises and Insights (quarter page)

  • What surprised you about the results?
  • How do these findings compare to your intuition?

Section 3: Business Implications (quarter page)

  • How would you recommend a retailer use these insights?
  • What additional data would make this analysis more useful?

Part 8: Preparing Your Demo

For Zoom Session 3, prepare a 2-3 minute demonstration:

  1. Setup (30 sec): Briefly explain what the agent does
  2. Demo (1.5 min): Show 2-3 questions and responses
  3. Insight (1 min): Share your most interesting finding

Tips:

  • Have your script running before you share screen
  • Pre-type a question so you’re not typing live
  • Focus on business insights, not technical details

Troubleshooting

“ModuleNotFoundError: No module named ‘pandas’”

Run: pip install pandas numpy scikit-learn

“FileNotFoundError: oj_data.csv”

Make sure the data file is in the same folder as your Python script.

“Model coefficients look wrong”

Check that your data loaded correctly - you should have ~28,000 rows.

“Agent doesn’t understand my question”

Try rephrasing using keywords like “predict”, “price-sensitive”, “feature”, “optimal”, or “compare”.

Complete Code Reference

The complete oj_agent.py file should be approximately 350-400 lines. If you get stuck, ask Cursor’s AI assistant for help by selecting your code and pressing Cmd+K (Mac) or Ctrl+K (Windows), then describing your issue. I also prepared a complete code reference for you to refer.

Next Steps (Optional Enhancements)

If you finish early and want to explore further:

  1. Add more questions: What other business questions could the agent answer?
  2. Improve the NLP: Use fuzzy matching to better understand varied phrasings
  3. Add visualizations: Create charts showing price vs. sales by brand
  4. Connect to an LLM: Use the OpenAI API to make the agent truly conversational

Good luck with your project! 🍊

Case Study: Mudslide Threat

I live in a house at risk of a mudslide damage.

  • Option A: Build a protective wall ($10,000).
  • Damage Cost: $100,000 if the house is hit (and wall fails/absent).
  • Wall Effectiveness: 95% protection.
  • Probability of Mudslide: \(P(\text{Slide}) = 0.01\).

What is the best course of action?

Decision Tree: Initial Options

graph LR
    Start((Decision)) --> Build[Build Wall]
    Start --> NoBuild[Don't Build]
    
    Build -- "$10,000" --> WallNode{Slide?}
    WallNode -- "0.01" --> FailNode{Wall Fails?}
    FailNode -- "0.05" --> Loss["$100,000 Cost"]
    FailNode -- "0.95" --> NoLoss["$0 Cost"]
    WallNode -- "0.99" --> NoSlide["$0 Cost"]
    
    NoBuild -- "$0" --> SlideNode{Slide?}
    SlideNode -- "0.01" --> Loss2["$100,000 Cost"]
    SlideNode -- "0.99" --> NoLoss2["$0 Cost"]

Comparison: No Test

Don’t Build

\(EV = 0.01 \times \$100,000 = \$1,000\)

Build (No Test)

\(EV = \$10,000 + (0.01 \times 0.05 \times \$100,000) = \$10,050\)

[!IMPORTANT] Based purely on expected cost, Don’t Build is the rational choice despite the high impact of a slide.

The Geological Test

A test is available to better estimate the risk.

  • Cost: $3,000
  • Accuracy:
    • \(P( T \mid \text{Slide} ) = 0.90\)
    • \(P( \text{not } T \mid \text{No Slide} ) = 0.85\)

Should we take the test?

Updating Probabilities: Bayes’ Rule

Probability of Positive Test \(P(T)\):

\(P(T) = (0.90 \times 0.01) + (0.15 \times 0.99) = 0.1575\)

Posterior \(P(\text{Slide} \mid T)\):

\(P(\text{Slide} \mid T) = \frac{0.90 \times 0.01}{0.1575} \approx 0.0571\)

Posterior \(P(\text{Slide} \mid \text{not } T)\):

\(P(\text{Slide} \mid \text{not } T) = \frac{0.1 \times 0.01}{0.8425} \approx 0.0012\)

The Testing Strategy

If we test:

  1. If \(T\): Build the wall.
  2. If not \(T\): Don’t build.

Expected Cost with Test:

\[\begin{aligned} &\text{Test Cost} + P(T) \times \text{EV(Build} \mid T) \\ &\quad + P(\text{not } T) \times \text{EV(No Build} \mid \text{not } T) \end{aligned}\]

\(= 3,000 + (0.1575 \times 10,285) + (0.8425 \times 120) \approx \$4,693\)

Risk vs. Reward

Choice Expected Cost Risk of Loss P
Don’t Build $1,000 0.01 1 in 100
Build w/o test $10,050 0.0005 1 in 2000
Test & Decide $4,693 0.00146 1 in 700

Conclusion

  • Lowest Expected Cost: Don’t Build ($1,000).
  • Lowest Risk of Catastrophe: Build without testing (0.0005).
  • Middle Ground: Testing ($4,693) significantly reduces risk compared to “Don’t Build” without the full $10k upfront cost.

Decision? It depends on your utility function (risk tolerance).

View Python Implementation (Notebook)

Saint Petersburg Paradox

Imagine a gambling game where a fair coin is flipped repeatedly until it lands on heads. The payoff for the game is \(2^N\), where \(N\) is the number of tosses needed for the coin to land on heads.

The expected value of this game is infinite:

\[ E(X) = \frac{1}{2} \cdot 2 + \frac{1}{4} \cdot 4 + \frac{1}{8} \cdot 8 + \ldots = \infty \]

This means that, in theory, a rational person should be willing to pay any finite amount to play this game. However, in reality, most people would be unwilling to pay a large amount.

Expected Utility Resolution

Bernoulli argued that people do not maximize expected monetary value but rather expected utility \(U(x)\).

\[ E[U(X)] = \sum^\infty_{k=1} 2^{-k} U(2^k) \]

For the log utility case, \(U(x) = \log(x)\), the expected utility is \(2 \log(2)\). To find the certain dollar amount \(x^*\) (certainty equivalent) that provides the same utility:

\[ \log(x^*) = 2\log(2) = \log(2^2) = \log(4) \implies x^* = 4 \]

Under log utility, a rational player would pay at most $4 to play, despite the infinite expected monetary value.

Online: 2 Weeks (14 Days)

Email: vsokolov@gmu.edu

Phone: 703 993 4533

Course Textbook: Bayes, AI and Deep Learning by Nick Polson and Vadim Sokolov. The book is to be published by Chapman & Hall/CRC in 2026. Available for free online.

Topic Purpose

The purpose of this topic is to introduce participants to the foundational concepts of artificial intelligence and data-driven decision making. Participants will develop a working understanding of probability, statistical modeling, and modern AI techniques—equipping them to lead AI initiatives, evaluate AI investments, and communicate effectively with technical teams.

Topic Overview

This module takes executives on a journey from the fundamentals of probability and uncertainty through statistical modeling to the cutting edge of modern AI. Rather than focusing on mathematical derivations, we emphasize intuition, real-world applications, and business implications. Through compelling case studies—from wrongful convictions caused by probability errors to the Netflix Prize’s lessons about model complexity—participants will learn to think probabilistically about business decisions. The module culminates in a hands-on project where participants build an AI agent using Cursor IDE, directly experiencing how data, models, and AI agents work together to solve business problems.

Topic Objectives

Upon completion of this topic, you should understand and be able to:

  • Apply probabilistic thinking to business decisions under uncertainty
  • Recognize common probability fallacies (prosecutor’s fallacy, base rate neglect) and their business implications
  • Understand the trade-offs between model accuracy, complexity, and business value
  • Interpret regression models and explain their predictions to stakeholders
  • Evaluate when AI/ML solutions are appropriate versus traditional statistical approaches
  • Build a simple AI agent that combines data analysis with natural language interaction
  • Lead informed conversations with data science and AI teams

Course Approach

This topic combines asynchronous learning (recorded lectures, readings, discussion boards) with synchronous sessions (live Zoom calls) and hands-on practice. The approach emphasizes:

  • Case-based learning: Each concept is grounded in real-world examples—from legal cases to sports analytics to retail pricing
  • Business-first perspective: Technical concepts are always connected to business decisions and outcomes
  • Progressive building: Each module builds on the previous, culminating in an integrated final project
  • Peer learning: Discussion boards encourage sharing experiences and learning from diverse industry perspectives
  • Applied practice: The final project provides hands-on experience building an AI-powered analytics tool

Time Commitment

This topic will require approximately 10 hours of work to complete:

Activity Hours
Recorded Lectures (3 lectures × 30 min) 1.5
Live Zoom Sessions (3 sessions × 1 hr) 3.0
Reading 2
Discussion Boards 1.0
Final Project 2.0
Total 10.0

Schedule

Day Activities
Day 1-2 Module 1 lectures available; begin readings on probability and Bayes rule
Day 3 Discussion Board 1 opens
Day 4 Zoom Session 1: Kick-off + Cursor IDE Hands-on (1 hr)
Day 5-6 Module 2 lectures available; readings on statistics and regression
Day 7 Discussion Board 2 opens
Day 8 Zoom Session 2: Mid-point Check-in (1 hr)
Day 9-10 Module 3 lectures available; readings on NLP and AI agents
Day 11 Discussion Board 3 opens
Day 12-13 Final Project work time
Day 14 Zoom Session 3: Wrap-up + Final Project Presentations (1 hr)

Module 1: Probability as a Language of Uncertainty

Days 1-4

Recorded Lectures

Required Reading (from course textbook)

Chapter 1: Probability and Uncertainty

  • Opening sections through “Kolmogorov Axioms”
  • Section: “Conditional, Marginal and Joint Distributions”
  • Example: Salary-Happiness

Chapter 2: Bayes Rule

  • Section: “Intuition and Simple Examples”
  • Example: Sally Clark Case
  • Example: Nakamura’s Alleged Cheating

Chapter 4: Utility, Risk and Decisions

  • Section: “Expected Utility”
  • Examples: Saint Petersburg Paradox, Kelly Criterion, Ellsberg Paradox, Secretary Problem
  • Section: “Decision Trees” (including Medical Testing and Mudslide examples)

Supplemental Reading (online)

Discussion Board: Decision-Making Under Uncertainty

Opens Day 3 | Due Day 7

“Consider a strategic decision your organization recently faced (or is currently facing) involving uncertainty. Describe the decision and identify:

  1. What were the key uncertain factors?
  2. How was probability or likelihood assessed (formally or informally)?
  3. Reflecting on the Ellsberg paradox and Kelly criterion, how might a more systematic probabilistic approach have changed the decision-making process?

Respond to at least two peers’ posts with constructive suggestions.”

Zoom Session 1: Kick-off + Cursor Hands-on

Day 4 | 1 hour

  • Welcome and module overview (15 min)
  • Hands-on: Setting up Cursor IDE and using coding agents (30 min)
  • Q&A on probability concepts from Module 1 (15 min)

Preparation: Install Cursor IDE before the session (setup instructions)

Module 2: Statistics and Modeling

Days 5-8

Recorded Lectures

Required Reading (from course textbook)

Chapter 1: Probability and Uncertainty

  • Sections: “Normal Distribution,” “Poisson Distribution,” “Binomial Distribution”
  • Examples: Heights of Adults, Customer Arrivals, NFL Patriots Coin Toss

Chapter 3: Bayesian Learning

  • Section: “Poisson Model for Count Data”

Chapter 12: Linear Regression

  • Section: “Linear Regression” (opening)
  • Examples: Google vs S&P 500, Orange Juice

Chapter 13: Logistic Regression and GLMs

  • Sections: “Model Fitting,” “Confusion Matrix,” “ROC Curve”
  • Example: NBA point spread

Supplemental Reading (online)

Discussion Board: Predictive Models in Business

Opens Day 7 | Due Day 11 “The Netflix Prize awarded $1 million for a 10% improvement in recommendation accuracy, yet Netflix never fully implemented the winning algorithm—it was too complex and expensive to deploy, and by then, streaming had changed the business model entirely. See Why Even a Million Dollars Couldn’t Buy a Better Algorithm - Wired (Netflix Prize case study)

Reflecting on this case and the regression concepts from this module:

  1. Identify a business process in your organization where a predictive model could be applied. What decisions would it inform?
  2. What would happen if the model’s predictions were inaccurate 20% of the time? 40% of the time? How would this affect business outcomes and trust in the system?
  3. Discuss the trade-off: Is a highly accurate but complex/expensive model always better than a simpler, ‘good enough’ model? What factors would you consider when making this decision?

Respond to at least two peers’ posts, particularly focusing on whether you agree with their assessment of the accuracy-complexity trade-off.”

Zoom Session 2: Mid-point Check-in

Day 8 | 1 hour

  • Review of statistical modeling concepts (20 min)
  • Live demo: Building a simple regression model with Cursor (25 min)
  • Discussion of final project requirements (15 min)

Preparation: Complete Module 2 lectures and readings

Module 3: Modern AI

Days 9-14

Required Reading (from course textbook)

Chapter 24: Natural Language Processing

  • Sections: “Converting Words to Numbers (Embeddings),” “Word2Vec and Distributional Semantics”
  • Example: Word2Vec for War and Peace
  • Sections: “Attention Mechanisms,” “Transformer Architecture” (overview)

Chapter 26: AI Agents

  • Full chapter overview (agent architecture, tool use, planning, safety)

Supplemental Reading (online)

Discussion Board: AI Agents in the Enterprise

Opens Day 11 | Due Day 14

“AI agents are increasingly being deployed in business contexts. Describe a workflow or process in your organization that could potentially be automated or augmented by an AI agent. Address:

  1. What tasks would the agent perform?
  2. What data or tools would it need access to?
  3. What guardrails or human oversight would be necessary?
  4. What risks or concerns would need to be addressed before deployment?

Respond to at least two peers’ posts.”

Zoom Session 3: Wrap-up + Final Project Presentations

Day 14 | 1 hour

  • Brief Modern AI recap (10 min)
  • Final project presentations/demonstrations (35 min)
  • Course wrap-up and next steps for AI leadership (15 min)

Preparation: Complete final project; prepare 2-3 minute demonstration

Assignment: Final Project

Orange Juice Pricing Analytics Agent

This part (as ever other part of the module) is optional.

Business Problem: You are a pricing analyst at a retail chain. Management wants to optimize orange juice pricing and promotional strategies. Build an AI agent that can answer business questions about pricing decisions using historical sales data and a predictive model.

Dataset: Dominick’s Orange Juice Dataset

  • Weekly sales data for orange juice brands (Tropicana, Minute Maid, Dominick’s)
  • Variables: sales volume, price, advertising features, brand
  • ~28,000 observations across multiple stores

Model: Linear Regression with Interactions

  • Predict sales volume based on price, advertising, and brand
  • Capture how price sensitivity varies by brand

Your Agent Must Answer These Business Questions:

  1. “What is the predicted sales volume if we price Tropicana at $2.50 with no advertising?”
  2. “Which brand is most price-sensitive?”
  3. “Should we feature Minute Maid in this week’s ad circular? What’s the expected sales lift?”
  4. “What price should we set for Dominick’s brand to maximize revenue?”
  5. “Compare the price elasticity across the three brands.”

Deliverables:

  1. Python code files in Cursor IDE (using provided template)
  2. Working agent that answers the 5 business questions above
  3. 1-page summary: What did you learn about OJ pricing? What surprised you?
  4. 2-3 minute demo during Zoom Session 3

Evaluation Criteria:

  • Functionality: Agent loads data, builds model, and responds to queries
  • Business Relevance: Clear connection between model outputs and business decisions
  • Documentation: Clear explanation of approach and results

See Final Project Guide for detailed step-by-step instructions.